Use Pegasus in Huggingface for a downstream classification task

I have collected a dataset of paragraphs summaries, where the summary may or may not correspond to the paragraph it is paired with. I also have the labels of whether a summary corresponds to the paragraph or not (1 if it is a corresponding pair, and 0 if it is not).
I would like to use the pretrained Pegasus_large model in Huggingface (off-the-shelf) and train it on this downstream classification task.
Since Pegasus does not have any CLS token, I was thinking of possible ways of doing this.
I want to concatenate the paragraph and summary together, pass it through the pretrained Pegasus encoder only, and then pool over the final hidden layer outputs of the encoder. If I use the Huggingface PegasusModel (the one without and summary generation head), it expects me to provide decoder_input_ids, which I assume are the true tokens (labels) when pegasus is trained as a seq2seq model for summary generation. However, since I am not training my model to generate summaries, and would like the encoder representation only, I am not sure what to put as my decoder_input_ids.
My questions are: 1. Am I right in assuming the decoder_input_ids are only used for training the model for sequence generation, and 2. How should I get the last hidden layer outputs without having any decoder_input_ids?


Embedding layer output is stochastic?

If I use two identical models to learn over a dataset, but the order in which the samples are presented differs, would an embedding layer output the exact embeddings?
I think you will not get exact embeddings. The parameters of embeddings depend on how gradient decent selects them, so you probably get different values when the sample batch order is different. Furthermore, there is an initial random weight initialization for embedding layer, which also could contribute to a difference.
However, I would expect that 2 words close in one embedding will be also close in another embedding.

Bert for relation extraction

i am working with bert for relation extraction from binary classification tsv file, it is the first time to use bert so there is some points i need to understand more?
how can i get an output like giving it a test data and show the classification results whether it is classified correctly or not?
how bert extract features of the sentences, and is there a method to know what are the features that is chosen?
i used once the hidden layers and another time i didn't use i got the accuracy of not using the hidden layer higher than using it, is there an reason for that?

Custom word-embeddings in gensim

I have a word embedding matrix (say M) obtained of order V x N where V is the size of the vocabulary and N is the size of each word vector. I want the word2vec model of gensim to initialise its word embedding matrix with M, during training. I am able to load M in the word2vec format using
but I don't know how to feed M into the gensim word2vec model.
The Gensim Word2Vec model isn't designed to be pre-initialized with outside vectors, so there's no built-in helper methods.
But of course since the source is available, and all objects can be modified by your direct tampering, you can still change/replace its usual (random) initialization with anything of your own choosing, wiht a little effort.
Specifically, you'd first create a Word2Vec model using the constructor without yet supplying any training corpus. (If you supply a corpus, it will automatically do the next two build_vocab() and train() steps for you, and you don't want that.)
Then, you'd perform the necessary .build_vocab() step, allowing it to survey your training text data to discover its vocabulary with word-frequencies, and perform its usual model initialization:
At this point, before doing any other training, you can tamper with the model to replace its random-initialization of words with your alternate word-vectors, as from the vectors you've loaded. If your other vectors are in the KeyedVectors variable loaded_kv, this could be as simple as:
for word in loaded_kv.index_to_key:
if word in model.wv:
model.kv[word] = loaded_kv[word]
Note that if your loaded_kv includes words that aren't in the corpus, or too rare (appear fewer than min_count times) in the corpus, the model will not have allocated a space for those vectors – as they won't be used in training – and they won't be part of the final model.
If for some reason you need them to be, you should ensure a sufficient number of valid usage examples of those words appear inside the corpus. You shouldn't add them to the model in a ways that changes the total number of vectors in model.wv after .build_vocab(), because the model is not expecting that sort of change, and errors/undefined-behavior are likely.
(You also shouldn't simply force extra words that aren't in the real training data into the model, because while they will remain unchanged through training, all other words will continue to be adjusted through training – meaning any words that weren't also incrementally adjusted, in an interleaved fashion, along with the rest may wind up essentially "incompatible" with the word-vectors that were co-trained in the same full model.)
After you've modified the model's initialization to match your preferences, continue with normal training, something like:
model.train(corpus, total_examples=model.corpus_count, epochs=model.epochs)

Creating input data for BERT modelling - multiclass text classification

I'm trying to build a keras model to classify text for 45 different classes. I'm a little confused about preparing my data for the input as required by google's BERT model.
Some blog posts insert data as a tf dataset with input_ids, segment ids, and mask ids, as in this guide, but then some only go with input_ids and masks, as in this guide.
Also in the second guide, it notes that the segment mask and attention mask inputs are optional.
Can anyone explain whether or not those two are required for a multiclass classification task?
If it helps, each row of my data can consist of any number of sentences within a reasonably sized paragraph. I want to be able to classify each paragraph/input to a single label.
I can't seem to find many guides/blogs about using BERT with Keras (Tensorflow 2) for a multiclass problem, indeed many of them are for multi-label problems.
I guess it is too late to answer but I had the same question. I went through huggingface code and found that if attention_mask and segment_type ids are None then by default it pays attention to all tokens and all the segments are given id 0.
If you want to check it out, you can find the code here
Let me know if this clarifies it or you think otherwise.

How to adopt multiple different loss functions in each steps of LSTM in Keras

I have a set of sentences and their scores, I would like to train a marking system that could predict the score for a given sentence, such one example is like this:
(X =Tomorrow is a good day, Y = 0.9)
I would like to use LSTM to build such a marking system, and also consider the sequential relationship between each word in the sentence, so the training example shown above is transformed as following:
(x1=Tomorrow, y1=is) (x2=is, y2=a) (x3=a, y3=good) (x4=day, y4=0.9)
When training this LSTM, I would like the first three time steps using a softmax classifier, and the final step using a MSE. It is obvious that the loss function used in this LSTM is composed of two different loss functions. In this case, it seems the Keras does not provide the way to address my problem directly. In addition, I am not sure whether my method to build the marking system is correct or not.
Keras support multiple loss functions as well:
model = Model(inputs=inputs,
outputs=[lang_model, sent_model])
loss=['categorical_crossentropy', 'mse'],
metrics=['accuracy'], loss_weights=[1., 1.])
Based on your explanation, I think you need a model that first, predict a token based on previous tokens, in NLP domain it usually called Language model, and then compute a score which I assume it is a sentiment (it is applicable to other domain).
To do so, you can train your language model with LSTM and pick the last output of LSTM for your ranking task. To this end, you need to define two loss function: categorical_crossentropy for the language model and MSE for the ranking task.
This tutorial would be helpful:
