I have a dataset of utterances and corresponding sentiment label. I want to use an embedding of the sentiment label as an additional input to BERT (To simplify things, you can say that I want to initialize the embeddings for some tokens in my BERT model). There are 6-7 unique labels. I planned to use static embeddings like GloVe to map the label to an embedding, but this will not be compatible with BERT, which expects the input embedding to be of size 768. How can I generate static embeddings of my labels?
You can try sbert to generate embedding for both of your sentence and label of given dimension size.
Here is the library - https://www.sbert.net/
Related
I have successfully implemented word2vec model to generate word embedding or word vectors now I require to generate sentence vectors from generated word vectors so that I can feed a neural network to summarize a text corpus. What are the common approaches to generate sentence vectors from word vectors?
You can try adding an LSTM/RNN encoder before your actual Neural Network and feed your neural net using hidden states of your encoder( which would act as document representations).
Benefit of doing this is your document embeddings will be trained for your specific task of text summarization.
I don't know what framework you are using otherwise would have helped you with some code to get you started.
EDIT 1: Add code snippet
word_in = Input(shape=("<MAX SEQ LEN>",))
emb_word = Embedding(input_dim="<vocab size>", output_dim="<embd_dim>",input_length="<MAX SEQ LEN>", mask_zero=True)(word_in)
lstm = LSTM(units="<size>", return_sequences=False,
recurrent_dropout=0.5, name="lstm_1")(emb_word)
Add any type of dense layer which takes vectors as inputs.
LSTM takes input of shape batch_size * sequence_length * word_vector_dimension and produces output of shape batch_size * rnn_size; which you can use as document embeddings.
Sentence representations can simply be the column-wise mean of all the word vectors in your sentence. There are also implementation of this like doc2vec https://radimrehurek.com/gensim/models/doc2vec.html where a document is just a collection of words like a sentence or paragraph.
so I wanna ask you which is the best tool used to prepare my text to deep learning?
What is the difference between Word2Vec, Glove, Keras, LSA...
You should use a pre-trained embedding to represent the sentence into a vector or a matrix. There are a lot of sources where you can find pre-trained embeddings that use different dataset (for instance all the Wikipedia) to train their models. These models can have different length, but normally each word is represented with 100 or 300 dimensions.
Pre-trained embeddings
Pre-trained embeddings 2
I am able to generate similarity between two sentences loading spacy's core_lg model trained on Glove vectors.
Now, I want to update some domain specific sentences and assign same vectors to both sentences so that they are treated as the same.
So, how do I add vectors like these on top of the model I am using?
Is there any other way to approach this problem?
I want to build a seq2seq chatbot with a pre-trained Embedding matrix. Does the pre-trained Embedding matrix, for example GoogleNews-vectors-negative300, FastText and GloVe, has the specific word vector for <EOS> and <UNK>?
The pre-trained embedding has a specific vocabulary defined. The words which are not in vocabulary are called words also called oov( out of vocabulary) words. The pre-trained embedding matrix will not provide any embedding for UNK. There are various methods to deal with the UNK words.
Ignore the UNK word
Use some random vector
Use Fasttext as pre-trained model because it solves the oov problem by constructing vector for the UNK word from n-gram vectors that constitutes a word.
If the number of UNK is low the accuracy won't get affected a lot. If the number is higher better to train embedding or use fast text.
"EOS" Token can also be taken (initialized) as a random vector.
Make sure the both random vectors are not the same.
I do sequence classification with Keras, using an RNN and embeddings. My sequences are a bit weird. I have words mixed with special symbols. Words are associated with fixed, pre-trained embeddings, but the special symbol embeddings have to be modified during training.
In an Embedding layer during learning, how can I keep some embeddings fixed while updating others? Is there a way to mask those indices which shouldn't be modified? Or is this a case for a custom Embedding layer?
I do not believe that this is achievable with the existing Embedding layer. To get around it I would just create a custom layer that builds two embedding layers internally, and only puts the embedding matrix of one of them into the trainable_parameters.