Training SVM classifier (word embeddings vs. sentence embeddings) - svm

I want to experiment with different embeddings such Word2Vec, ELMo, and BERT but I'm a little confused about whether to use the word embeddings or sentence embeddings, and why. I'm using the embeddings as features input to SVM classifier.
Thank you.

Though both approaches can prove efficient for different datasets, as a rule of thumb I would advice you to use word embeddings when your input is of a few words, and sentence embeddings when your input in longer (e.g. large paragraphs).

Related

Is it OK to combine domain specific word2vec embeddings and off the shelf ELMo embeddings for a downstream unsupervised task?

I am wondering if I am using word embeddings correctly.
I have combined contextualised word vectors with static word vectors because:
my domain corpus is too small to effectively train the model from scratch
my domain is too specialised to use general embeddings.
I used the off the shelf ELMo small model and trained word2vec model on a small domain specific corpus (around 500 academic papers). I then did a simple concatenation of the vectors from the two different embeddings.
I loosely followed the approach in this paper:
https://www.aclweb.org/anthology/P19-2041.pdf
But the approach in the paper trains the embeddings for a specific task. In my domain there is no labeled training data. Hence me just training the embeddings on the corpus alone.
I am new to NLP, so apologies if I am asking a stupid question.

What is the significance of the magnitude/norm of BERT word embeddings?

We generally compare similarity between word embeddings with cosine similarity, but this only takes into account the angle between the vectors, not the norm. With word2vec, the norm of the vector decreases as the word is used in more varied contexts. So, stopwords are close to 0 and very unique, high meaning words tend to be large vectors. BERT is context sensitive, so this explanation doesn't entirely cover BERT embeddings. Does anyone have any idea what the significance of vector magnitude could be with BERT?
I don't think there is any difference in relation to cosine similarity or norm of a vector, between BERT and other embeddings like GloVE or Word2Vec. Its just that BERT is context dependent embedding so provide different embeddings of a word for different context.

Best tool for text representation to deep learning

so I wanna ask you which is the best tool used to prepare my text to deep learning?
What is the difference between Word2Vec, Glove, Keras, LSA...
You should use a pre-trained embedding to represent the sentence into a vector or a matrix. There are a lot of sources where you can find pre-trained embeddings that use different dataset (for instance all the Wikipedia) to train their models. These models can have different length, but normally each word is represented with 100 or 300 dimensions.
Pre-trained embeddings
Pre-trained embeddings 2

Can we use char rnn to create embeddings for out of vocabulary words?

I have word embeddings for 10 million words which were trained on a huge corpus. Now I want to produce word embeddings for out of vocabulary words. Can I design some char RNN to use these word embeddings and generate embeddings for out of vocab words? Or is there any other I can get embeddings for OOV words?
FastText is capable of producing word embeddings for OOV but it does not have distributed way of training or GPU implementation, so in my case, it could take almost 3 months to finish training. Any suggestions on this?

Does pre-trained Embedding matrix has <EOS>, <UNK> word vector?

I want to build a seq2seq chatbot with a pre-trained Embedding matrix. Does the pre-trained Embedding matrix, for example GoogleNews-vectors-negative300, FastText and GloVe, has the specific word vector for <EOS> and <UNK>?
The pre-trained embedding has a specific vocabulary defined. The words which are not in vocabulary are called words also called oov( out of vocabulary) words. The pre-trained embedding matrix will not provide any embedding for UNK. There are various methods to deal with the UNK words.
Ignore the UNK word
Use some random vector
Use Fasttext as pre-trained model because it solves the oov problem by constructing vector for the UNK word from n-gram vectors that constitutes a word.
If the number of UNK is low the accuracy won't get affected a lot. If the number is higher better to train embedding or use fast text.
"EOS" Token can also be taken (initialized) as a random vector.
Make sure the both random vectors are not the same.

Resources