Using glove.6B.100d.txt embedding in spacy getting zero lex.rank - nlp

I am trying to load glove 100d emebddings in spacy nlp pipeline.
I create the vocabulary in spacy format as follows:
python -m spacy init-model en spacy.glove.model --vectors-loc glove.6B.100d.txt
glove.6B.100d.txt is converted to word2vec format by adding "400000 100" in the first line.
Now
spacy.glove.model/vocab has following files:
5468549 key2row
38430528 lexemes.bin
5485216 strings.json
160000128 vectors
In the code:
import spacy
nlp = spacy.load("en_core_web_md")
from spacy.vocab import Vocab
vocab = Vocab().from_disk('./spacy.glove.model/vocab')
nlp.vocab = vocab
print(len(nlp.vocab.strings))
print(nlp.vocab.vectors.shape) gives
gives
407174
(400000, 100)
However the problem is that:
V=nlp.vocab
max_rank = max(lex.rank for lex in V if lex.has_vector)
print(max_rank)
gives 0
I just want to use the 100d glove embeddings within spacy in combination with "tagger", "parser", "ner" models from en_core_web_md.
Does anyone know how to go about doing this correctly (is this possible)?

The tagger/parser/ner models are trained with the included word vectors as features, so if you replace them with different vectors you are going to break all those components.
You can use new vectors to train a new model, but replacing the vectors in a model with trained components is not going to work well. The tagger/parser/ner components will most likely provide nonsense results.
If you want 100d vectors instead of 300d vectors to save space, you can resize the vectors, which will truncate each entry to first 100 dimensions. The performance will go down a bit as a result.
import spacy
nlp = spacy.load("en_core_web_md")
assert nlp.vocab.vectors.shape == (20000, 300)
nlp.vocab.vectors.resize((20000, 100))

Related

How to convert the text into vector using word2vec embedding?

Suppose I have a dataframe shown below:
|Text
|Storm in RI worse than last hurricane
|Green Line derailment in Chicago
|MEG issues Hazardous Weather Outlook
I created word2vec model using below code:
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
text_data = sent_to_words(df['Text'])
w2v_model = gensim.models.Word2Vec(text_data, size=100, min_count=1, window=5, iter=50)
now how I will convert the text present in the 'Text' column to vectors using this word2vec model?
you can get generated word embeddings by
w2v_model.wv
you can get word embeddings of a specific word by
w2v_model.wv['word']
Word2Vec models can only map words to vectors, so, as #metalrt mentioned, you have to use a function over the set of word vectors to convert them to a single sentence vector. A good baseline is to compute the mean of the word vectors:
import numpy as np
df["Text"].apply(lambda text: np.mean([w2v_model.wv[word] for word in text.split() if word in w2v_model.wv]))
The example above implements very simple tokenization by whitespace characters. You can also use spacy library to implement better tokenization:
import spacy
nlp = spacy.load("en_core_web_sm")
df["Text"].apply(lambda text: np.mean([self.keyed_vectors[token.text] for token in nlp.pipe(text) if not token.is_punct and token.text in self.keyed_vectors]))

Cannot reproduce pre-trained word vectors from its vector_ngrams

Just curiosity, but I was debugging gensim's FastText code for replicating the implementation of Out-of-Vocabulary (OOV) words, and I'm not being able to accomplish it.
So, the process i'm following is training a tiny model with a toy corpus, and then comparing the resulting vectors of a word in the vocabulary. That means if the whole process is OK, the output arrays should be the same.
Here is the code I've used for the test:
from gensim.models import FastText
import numpy as np
# Default gensim's function for hashing ngrams
from gensim.models._utils_any2vec import ft_hash_bytes
# Toy corpus
sentences = [['hello', 'test', 'hello', 'greeting'],
['hey', 'hello', 'another', 'test']]
# Instatiate FastText gensim's class
ft = FastText(sg=1, size=5, min_count=1, \
window=2, hs=0, negative=20, \
seed=0, workers=1, bucket=100, \
min_n=3, max_n=4)
# Build vocab
ft.build_vocab(sentences)
# Fit model weights (vectors_ngram)
ft.train(sentences=sentences, total_examples=ft.corpus_count, epochs=5)
# Save model
ft.save('./ft.model')
del ft
# Load model
ft = FastText.load('./ft.model')
# Generate ngrams for test-word given min_n=3 and max_n=4
encoded_ngrams = [b"<he", b"<hel", b"hel", b"hell", b"ell", b"ello", b"llo", b"llo>", b"lo>"]
# Hash ngrams to its corresponding index, just as Gensim does
ngram_hashes = [ft_hash_bytes(n) % 100 for n in encoded_ngrams]
word_vec = np.zeros(5, dtype=np.float32)
for nh in ngram_hashes:
word_vec += ft.wv.vectors_ngrams[nh]
# Compare both arrays
print(np.isclose(ft.wv['hello'], word_vec))
The output of this script is False for every dimension of the compared arrays.
It would be nice if someone could point me out if i'm missing something or doing something wrong. Thanks in advance!
The calculation of a full word's FastText word-vector is not just the sum of its character n-gram vectors, but also a raw full-word vector that's also trained for in-vocabulary words.
The full-word vectors you get back from ft.wv[word] for known-words have already had this combination pre-calculated. See the adjust_vectors() method for an example of this full calculation:
https://github.com/RaRe-Technologies/gensim/blob/68ec5b8ed7f18e75e0b13689f4da53405ef3ed96/gensim/models/keyedvectors.py#L2282
The raw full-word vectors are in a .vectors_vocab array on the model.wv object.
(If this isn't enough to reconcile matters: ensure you're using the latest gensim, as there have been many recent FT fixes. And, ensure your list of ngram-hashes matches the output of the ft_ngram_hashes() method of the library – if not, your manual ngram-list-creation and subsequent hashing may be doing something different.)

How to turn a list of words into a list of vectors using a pre-trained word2vec model(Google)?

I am trying to learn word2vec.
I am using the code below to load the Google pre-trained word2vec model in Python 3. But I am unsure how to turn a list such as :["I", "ate", "apple"] to a list of vectors (ie how to get vectors from this model?).
import nltk
import gensim
# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)
You get the vector via idiomatic Python keyed-index-access (brackets). For example:
wv_apple = model['apple']
You can create a new list based on some operation on every item of an existing list via an idiomatic Python 'list comprehension' ([expression(x) for x in some_list]), For example:
words = ["I", "ate", "apple"]
vectors = [model[word] for word in words]

How to find the score for sentence Similarity using Word2Vec

I am new to NLP, how to find the similarity between 2 sentences and also how to print scores of each word. And also how to implement the gensim word2Vec model.
Try this code:
here my two sentences :
sentence1="I am going to India"
sentence2=" I am going to Bharat"
from gensim.models import word2vec
import numpy as np
words1 = sentence1.split(' ')
words2 = sentence2.split(' ')
#The meaning of the sentence can be interpreted as the average of its words
sentence1_meaning = word2vec(words1[0])
count = 1
for w in words1[1:]:
sentence1_meaning = np.add(sentence1_meaning, word2vec(w))
count += 1
sentence1_meaning /= count
sentence2_meaning = word2vec(words2[0])
count = 1
for w in words2[1:]:
sentence2_meaning = np.add(sentence2_meaning, word2vec(w))
count += 1
sentence2_meaning /= count
#Similarity is the cosine between the vectors
similarity = np.dot(sentence1_meaning, sentence2_meaning)/(np.linalg.norm(sentence1_meaning)*np.linalg.norm(sentence2_meaning))
You can train the model and use the similarity function to get the cosine similarity between two words.
Here's a simple demo:
from gensim.models import Word2Vec
from gensim.test.utils import common_texts
model = Word2Vec(common_texts,
size = 500,
window = 5,
min_count = 1,
workers = 4)
word_vectors = model.wv
word_vectors.similarity('computer', 'computer')
The output will be 1.0, of course, which indicates 100% similarity.
After your from gensim.models import word2vec, word2vec is a Python module – not a function that you can call as word2vec(words1[0]) or word2vec(w).
So your code isn't even close to approaching this correctly, and you should review docs/tutorials which demonstrate the proper use of the gensim Word2Vec class & supporting methods, then mimic those.
As #david-dale mentions, there's a basic intro in the gensim docs for Word2Vec:
https://radimrehurek.com/gensim/models/word2vec.html
The gensim library also bundles within its docs/notebooks directory a number of Jupyter notebooks demonstrating various algorithms & techniques. The notebook word2vec.ipynb shows basic Word2Vec usage; you can also view it via the project's source code repository at...
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb
...however, it's really best to run as a local notebook, so you can step through the execution cell-by-cell, and try different variants yourself, perhaps even adapting it to use your data instead.
When you reach that level, note that:
these models require far more than just a few sentences as training - so ideally you'd either have (a) many sentences from the same domain as those you're comparing, so that the model can learn words in those contexts; (b) a model trained from a compatible corpus, which you then apply to your out-of-corpus sentences.
using the average of all the word-vectors in a sentence is just one relatively-simple way to make a vector for a longer text; there are many other more-sophisticated ways. One alternative very similar to Word2Vec is the 'Paragraph Vector' algorithm also available in gensim as the class Doc2Vec.

Improving on the basic, existing GloVe model

I am using GloVe as part of my research. I've downloaded the models from here. I've been using GloVe for sentence classification. The sentences I'm classifying are specific to a particular domain, say some STEM subject. However, since the existing GloVe models are trained on a general corpus, they may not yield the best results for my particular task.
So my question is, how would I go about loading the retrained model and just retraining it a little more on my own corpus to learn the semantics of my corpus as well? There would be merit in doing this were it possible.
After a little digging, I found this issue on the git repo. Someone suggested the following:
Yeah, this is not going to work well due to the optimization setup. But what you can do is train GloVe vectors on your own corpus and then concatenate those with the pretrained GloVe vectors for use in your end application.
So that answers that.
I believe GloVe (Global Vectors) is not meant to be appended, since it is based on the corpus' overall word co-occurrence statistics from a single corpus known only at initial training time
You can do is use gensim.scripts.glove2word2vec api to convert GloVe vectors into word2vec, but i dont think you can continue training since its loading in a KeyedVector not a Full Model
Mittens library (installable via pip) does that if your corpus/vocab is not too huge or your RAM is big enough to handle the entire co-occurrence matrix.
3 steps-
import csv
import numpy as np
from collections import Counter
from nltk.corpus import brown
from mittens import GloVe, Mittens
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import CountVectorizer
1- Load pretrained model - Mittens needs a pretrained model to be loaded as a dictionary. Get the pretrained model from https://nlp.stanford.edu/projects/glove
with open("glove.6B.100d.txt", encoding='utf-8') as f:
reader = csv.reader(f, delimiter=' ',quoting=csv.QUOTE_NONE)
embed = {line[0]: np.array(list(map(float, line[1:])))
for line in reader}
Data pre-processing
sw = list(stop_words.ENGLISH_STOP_WORDS)
brown_data = brown.words()[:200000]
brown_nonstop = [token.lower() for token in brown_data if (token.lower() not in sw)]
oov = [token for token in brown_nonstop if token not in pre_glove.keys()]
Using brown corpus as a sample dataset here and new_vocab represents the vocabulary not present in pretrained glove. The co-occurrence matrix is built from new_vocab. It is a sparse matrix, requiring a space complexity of O(n^2). You can optionally filter out rare new_vocab words to save space
new_vocab_rare = [k for (k,v) in Counter(new_vocab).items() if v<=1]
corp_vocab = list(set(new_vocab) - set(new_vocab_rare))
remove those rare words and prepare dataset
brown_tokens = [token for token in brown_nonstop if token not in new_vocab_rare]
brown_doc = [' '.join(brown_tokens)]
corp_vocab = list(set(new_vocab))
2- Building co-occurrence matrix:
sklearn’s CountVectorizer transforms the document into word-doc matrix.
The matrix multiplication Xt*X gives the word-word co-occurrence matrix.
cv = CountVectorizer(ngram_range=(1,1), vocabulary=corp_vocab)
X = cv.fit_transform(brown_doc)
Xc = (X.T * X)
Xc.setdiag(0)
coocc_ar = Xc.toarray()
3- Fine-tuning the mittens model - Instantiate the model and run the fit function.
mittens_model = Mittens(n=50, max_iter=1000)
new_embeddings = mittens_model.fit(
coocc_ar,
vocab=corp_vocab,
initial_embedding_dict= pre_glove)
Save the model as pickle for future use.
newglove = dict(zip(corp_vocab, new_embeddings))
f = open("repo_glove.pkl","wb")
pickle.dump(newglove, f)
f.close()

Resources