Use custom Word2Vec embedding instead of GloVe - keras

I am working on a text generation using seq2seq model where GloVe embedding is being used. I want to use a custom Word2Vec (CBOW/Gensim) embedding in this code. Can anyone please help to use my custom embedding instead of GloVe?
def initialize_embeddings(self):
"""Reads the GloVe word-embeddings and creates embedding matrix and word to index and index to word mapping."""
# load the word embeddings
self.word2vec = {}
with open(glove_path%self.EMBEDDING_DIM, 'r') as file:
for line in file:
vectors = line.split()
self.word2vec[vectors[0]] = np.asarray(vectors[1:], dtype="float32")```
```# get the embeddings matrix
self.num_words = min(self.MAX_VOCAB_SIZE, len(self.word2idx)+1)
self.embeddings_matrix = np.zeros((self.num_words, self.EMBEDDING_DIM))
for word, idx in self.word2idx.items():
if idx <= self.num_words:
word_embeddings = self.word2vec.get(word)
if word_embeddings is not None:
self.embeddings_matrix[idx] = word_embeddings
self.idx2word = {v:k for k,v in self.word2idx.items()}
This code is for GloVe embedding which is transformed to Word2Vec. I want to load my own Word2Vec embedding.

word2vec and Glove are a techniques for producing word embeddings, i.e., for modelling text (a set of sentences) into computer-readable vectors.
While word2vec trains on the local context (neighboring words), Glove will look for words co-occurrence in a whole text or corpus, its approach is more global.
word2vec
There are two main approaches for word2vec, in which the algorithm loops through the worlds of the sentence. For each current word w it will try to predict
the neighboring words from w and its context, this is the Skip-Gram approach
w from its context, this is the CBOW approach
Hence, word2vec will produce a similar embedding for words with similar contexts, for instance a noun in singular and its plural, or two synonyms.
Glove
The main intuition underlying the Glove model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. In other words the embeddings are based on the computation of distances between pairs of target words. The model computes the distance between two target words in a text by analyzing the co-occurence of those two target words with some other probe words (contextual words).
https://nlp.stanford.edu/projects/glove/
For example, consider the co-occurrence probabilities for target words "ice" and "steam" with various probe words from the vocabulary. Here are some actual probabilities from a 6 billion word corpus:
As one might expect, "ice" co-occurs more frequently with "solid" than it does with "gas", whereas "steam" co-occurs more frequently with "gas" than it does with "solid". Both words co-occur with their shared property "water" frequently, and both co-occur with the unrelated word "fashion" infrequently. Only in the ratio of probabilities does noise from non-discriminative words like "water" and "fashion" cancel out, so that large values (much greater than 1) correlate well with properties specific to "ice", and small values (much less than 1) correlate well with properties specific of "steam". In this way, the ratio of probabilities encodes some crude form of meaning associated with the abstract concept of thermodynamic phase.
Also, Glove is very good at analogy, and performs well on the word2vec dataset.

Related

Text classification using Word2Vec and Pos tag

I have a medical dataset like
Text: "weakness, diarrhea, neck pain" Target:"X.1, Y.1" which is coded diagnosis
Also I am using pre-trained Word2Vec and pos tagging.
For example the word weakness has Word vector like
[0.2 0.04 ........ 0.05] (300 dim)
And pos tagging is "Symptom, Noun"
My question is how to combine pos tagging and word embedding to train with keras ?
There are multiple ways to deal with that.
You can build an ensemble model, i.e., you can train with pos tags and word2vec seperately using two different models. If you get the prediction value at the final layer (or some interpretation of probability in any model), you can take the average for your final prediction.
You can combine word2vec with pos tags to run a neural network.
However, I strongly believe POS tags will not be a good idea in these cases. You can see, all these words may have similar pos tags (most are isolated words and nouns), and data will have much less entropy.

What are some common ways to get sentence vector from corresponding word vectors?

I have successfully implemented word2vec model to generate word embedding or word vectors now I require to generate sentence vectors from generated word vectors so that I can feed a neural network to summarize a text corpus. What are the common approaches to generate sentence vectors from word vectors?
You can try adding an LSTM/RNN encoder before your actual Neural Network and feed your neural net using hidden states of your encoder( which would act as document representations).
Benefit of doing this is your document embeddings will be trained for your specific task of text summarization.
I don't know what framework you are using otherwise would have helped you with some code to get you started.
EDIT 1: Add code snippet
word_in = Input(shape=("<MAX SEQ LEN>",))
emb_word = Embedding(input_dim="<vocab size>", output_dim="<embd_dim>",input_length="<MAX SEQ LEN>", mask_zero=True)(word_in)
lstm = LSTM(units="<size>", return_sequences=False,
recurrent_dropout=0.5, name="lstm_1")(emb_word)
Add any type of dense layer which takes vectors as inputs.
LSTM takes input of shape batch_size * sequence_length * word_vector_dimension and produces output of shape batch_size * rnn_size; which you can use as document embeddings.
Sentence representations can simply be the column-wise mean of all the word vectors in your sentence. There are also implementation of this like doc2vec https://radimrehurek.com/gensim/models/doc2vec.html where a document is just a collection of words like a sentence or paragraph.

How to train a model that will result in the similarity score between two news titles?

I am trying to build a Fake news classifier and I am quite new in this field. I have a column "title_1_en" which has the title for fake news and another column called "title_2_en". There are 3 target labels; "agreed", "disagreed", and "unrelated" if the title of the news in column "title_2_en" agrees, disagrees or is unrelated to that in the first column.
I have tried calculating basic cosine similarity between the two titles after converting the words of the sentences into vectors. This has resulted in the the cosine similarity score but this needs a lot of improvement as synonyms and semantic relationship has not been considered at all.
def L2(vector):
norm_value = np.linalg.norm(vector)
return norm_value
def Cosine(fr1, fr2):
cos = np.dot(fr1, fr2)/(L2(fr1)*L2(fr2))
return cos
The most important thing here is how you convert the two sentences into vectors. There are multiple ways to do that and the most naive way is:
Convert each and every word into a vector - this can be done using standard pre-trained vectors such as word2vec or GloVe.
Now every sentence is just a bag of word vectors. This needs to be converted into a single vector, ie., mapping a full sentence text to a vector. There are many ways to do this too. For a start, just take the average of the bag of vectors in the sentence.
Compute cosine similarity between the two sentence vectors.
Spacy's similarity is a good place to start which does the averaging technique. From the docs:
By default, spaCy uses an average-of-vectors algorithm, using
pre-trained vectors if available (e.g. the en_core_web_lg model). If
not, the doc.tensor attribute is used, which is produced by the
tagger, parser and entity recognizer. This is how the en_core_web_sm
model provides similarities. Usually the .tensor-based similarities
will be more structural, while the word vector similarities will be
more topical. You can also customize the .similarity() method, to
provide your own similarity function, which can be trained using
supervised techniques.

How to sentence embed from gensim Word2Vec embedding vectors?

I have a pandas dataframe containing descriptions. I would like to cluster descriptions based on meanings usign CBOW. My challenge for now is to document embed each row into equal dimensions vectors. At first I am training the word vectors using gensim as so:
from gensim.models import Word2Vec
vocab = pd.concat((df['description'], df['more_description']))
model = Word2Vec(sentences=vocab, size=100, window=10, min_count=3, workers=4, sg=0)
I am however a bit confused now on how to replace the full sentences from my df with document vectors of equal dimensions.
For now, my workaround is repacing each word in each row with a vector then applying PCA dimentinality reduction to bring each vector to similar dimensions. Is there a better way of doing this though gensim, so that I could say something like this:
df['description'].apply(model.vectorize)
I think you are looking for sentence embedding. There are a lot ways of generating sentence embedding from word embeddings. You may find this useful: https://stats.stackexchange.com/questions/286579/how-to-train-sentence-paragraph-document-embeddings

Does pre-trained Embedding matrix has <EOS>, <UNK> word vector?

I want to build a seq2seq chatbot with a pre-trained Embedding matrix. Does the pre-trained Embedding matrix, for example GoogleNews-vectors-negative300, FastText and GloVe, has the specific word vector for <EOS> and <UNK>?
The pre-trained embedding has a specific vocabulary defined. The words which are not in vocabulary are called words also called oov( out of vocabulary) words. The pre-trained embedding matrix will not provide any embedding for UNK. There are various methods to deal with the UNK words.
Ignore the UNK word
Use some random vector
Use Fasttext as pre-trained model because it solves the oov problem by constructing vector for the UNK word from n-gram vectors that constitutes a word.
If the number of UNK is low the accuracy won't get affected a lot. If the number is higher better to train embedding or use fast text.
"EOS" Token can also be taken (initialized) as a random vector.
Make sure the both random vectors are not the same.

Resources