Gensim most_similar method coefficients are very low - nlp

I trained word embedding word2vec model using gensim and then used most_similar method to find the most associated words.
Word to search: forest
The result is below:
Most similar words: [('wood', 0.2495424747467041), ('trees', 0.24147865176200867), ('distant', 0.2403097301721573), ('island', 0.2402323037)]
I wonder why the coefficient is very low, even the top word is less than 0.25.
Thank you!

Related

Training SVM classifier (word embeddings vs. sentence embeddings)

I want to experiment with different embeddings such Word2Vec, ELMo, and BERT but I'm a little confused about whether to use the word embeddings or sentence embeddings, and why. I'm using the embeddings as features input to SVM classifier.
Thank you.
Though both approaches can prove efficient for different datasets, as a rule of thumb I would advice you to use word embeddings when your input is of a few words, and sentence embeddings when your input in longer (e.g. large paragraphs).

What is the significance of the magnitude/norm of BERT word embeddings?

We generally compare similarity between word embeddings with cosine similarity, but this only takes into account the angle between the vectors, not the norm. With word2vec, the norm of the vector decreases as the word is used in more varied contexts. So, stopwords are close to 0 and very unique, high meaning words tend to be large vectors. BERT is context sensitive, so this explanation doesn't entirely cover BERT embeddings. Does anyone have any idea what the significance of vector magnitude could be with BERT?
I don't think there is any difference in relation to cosine similarity or norm of a vector, between BERT and other embeddings like GloVE or Word2Vec. Its just that BERT is context dependent embedding so provide different embeddings of a word for different context.

Using cosine similarity for classifying documents

I have a set of files for five different categories and most of them are not labelled correctly.Objective is to predict the correct category of the file whenever the same is uploaded.I used cosine similarity along with tf -idf to predict the class of the document with which cosine similarity is the maximum as of now i am getting good results but really not sure how well this will work down the road. Also why isnt cosine similarity used in building document classifiers instead of machine learning models when the categories of files are labelled correctly?Would really appreciate your feedback on my approach as well as your answer to the question.
Cosine similarity is used for calculating the angle between two n-dimensional vectors. These vectors are mostly produced by Embeddings. They are pretrained models which produce word embeddings or fixed size vectors.
Cosine similarity is mostly used with vectors produced by word
embeddings. If you are using something like Doc2Vec, then you get a
vector for the whole document. These vectors could be categorized by
using cosine similarity.
In your case, you should try a LSTM text classifier using Embedding layers. 1D Convolution layers can also be useful.
Also, referring to TF-IDF, it is useful for text classification which is dependent on certain words in the corpus. The words with higher term frequency and less document frequency have a higher TF-IDF score. The model learns to classify texts based on such scores.
In most cases, RNNs are the best to classify texts. The use of pretrained embeddings makes the model efficient.
Also, not the least, you can give Bayes text classification a try. It has been super useful in spam classification.
Tip:
You can implement the above methods with each other, creating a text classification system. Following the process like,
Generate embeddings from Doc2Vec.
Comparing the similarity of the input with other texts and thereby determine its class.
Using the embedding in a LSTM network to produce class probabilities.
Apply Bayes text classification.
The steps 2 , 3 , 4 give three predictions. If the majority prediction was CLASS1, then we can make the output of the system as CLASS1!.

How to sentence embed from gensim Word2Vec embedding vectors?

I have a pandas dataframe containing descriptions. I would like to cluster descriptions based on meanings usign CBOW. My challenge for now is to document embed each row into equal dimensions vectors. At first I am training the word vectors using gensim as so:
from gensim.models import Word2Vec
vocab = pd.concat((df['description'], df['more_description']))
model = Word2Vec(sentences=vocab, size=100, window=10, min_count=3, workers=4, sg=0)
I am however a bit confused now on how to replace the full sentences from my df with document vectors of equal dimensions.
For now, my workaround is repacing each word in each row with a vector then applying PCA dimentinality reduction to bring each vector to similar dimensions. Is there a better way of doing this though gensim, so that I could say something like this:
df['description'].apply(model.vectorize)
I think you are looking for sentence embedding. There are a lot ways of generating sentence embedding from word embeddings. You may find this useful: https://stats.stackexchange.com/questions/286579/how-to-train-sentence-paragraph-document-embeddings

Does pre-trained Embedding matrix has <EOS>, <UNK> word vector?

I want to build a seq2seq chatbot with a pre-trained Embedding matrix. Does the pre-trained Embedding matrix, for example GoogleNews-vectors-negative300, FastText and GloVe, has the specific word vector for <EOS> and <UNK>?
The pre-trained embedding has a specific vocabulary defined. The words which are not in vocabulary are called words also called oov( out of vocabulary) words. The pre-trained embedding matrix will not provide any embedding for UNK. There are various methods to deal with the UNK words.
Ignore the UNK word
Use some random vector
Use Fasttext as pre-trained model because it solves the oov problem by constructing vector for the UNK word from n-gram vectors that constitutes a word.
If the number of UNK is low the accuracy won't get affected a lot. If the number is higher better to train embedding or use fast text.
"EOS" Token can also be taken (initialized) as a random vector.
Make sure the both random vectors are not the same.

Resources