Improving on the basic, existing GloVe model - nlp

I am using GloVe as part of my research. I've downloaded the models from here. I've been using GloVe for sentence classification. The sentences I'm classifying are specific to a particular domain, say some STEM subject. However, since the existing GloVe models are trained on a general corpus, they may not yield the best results for my particular task.
So my question is, how would I go about loading the retrained model and just retraining it a little more on my own corpus to learn the semantics of my corpus as well? There would be merit in doing this were it possible.

After a little digging, I found this issue on the git repo. Someone suggested the following:
Yeah, this is not going to work well due to the optimization setup. But what you can do is train GloVe vectors on your own corpus and then concatenate those with the pretrained GloVe vectors for use in your end application.
So that answers that.

I believe GloVe (Global Vectors) is not meant to be appended, since it is based on the corpus' overall word co-occurrence statistics from a single corpus known only at initial training time
You can do is use gensim.scripts.glove2word2vec api to convert GloVe vectors into word2vec, but i dont think you can continue training since its loading in a KeyedVector not a Full Model

Mittens library (installable via pip) does that if your corpus/vocab is not too huge or your RAM is big enough to handle the entire co-occurrence matrix.
3 steps-
import csv
import numpy as np
from collections import Counter
from nltk.corpus import brown
from mittens import GloVe, Mittens
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import CountVectorizer
1- Load pretrained model - Mittens needs a pretrained model to be loaded as a dictionary. Get the pretrained model from https://nlp.stanford.edu/projects/glove
with open("glove.6B.100d.txt", encoding='utf-8') as f:
reader = csv.reader(f, delimiter=' ',quoting=csv.QUOTE_NONE)
embed = {line[0]: np.array(list(map(float, line[1:])))
for line in reader}
Data pre-processing
sw = list(stop_words.ENGLISH_STOP_WORDS)
brown_data = brown.words()[:200000]
brown_nonstop = [token.lower() for token in brown_data if (token.lower() not in sw)]
oov = [token for token in brown_nonstop if token not in pre_glove.keys()]
Using brown corpus as a sample dataset here and new_vocab represents the vocabulary not present in pretrained glove. The co-occurrence matrix is built from new_vocab. It is a sparse matrix, requiring a space complexity of O(n^2). You can optionally filter out rare new_vocab words to save space
new_vocab_rare = [k for (k,v) in Counter(new_vocab).items() if v<=1]
corp_vocab = list(set(new_vocab) - set(new_vocab_rare))
remove those rare words and prepare dataset
brown_tokens = [token for token in brown_nonstop if token not in new_vocab_rare]
brown_doc = [' '.join(brown_tokens)]
corp_vocab = list(set(new_vocab))
2- Building co-occurrence matrix:
sklearn’s CountVectorizer transforms the document into word-doc matrix.
The matrix multiplication Xt*X gives the word-word co-occurrence matrix.
cv = CountVectorizer(ngram_range=(1,1), vocabulary=corp_vocab)
X = cv.fit_transform(brown_doc)
Xc = (X.T * X)
Xc.setdiag(0)
coocc_ar = Xc.toarray()
3- Fine-tuning the mittens model - Instantiate the model and run the fit function.
mittens_model = Mittens(n=50, max_iter=1000)
new_embeddings = mittens_model.fit(
coocc_ar,
vocab=corp_vocab,
initial_embedding_dict= pre_glove)
Save the model as pickle for future use.
newglove = dict(zip(corp_vocab, new_embeddings))
f = open("repo_glove.pkl","wb")
pickle.dump(newglove, f)
f.close()

Related

How to generate sentence embedding using long-former model

I am using Hugging Face mrm8488/longformer-base-4096-finetuned-squadv2 pre-trained model
https://huggingface.co/mrm8488/longformer-base-4096-finetuned-squadv2.
I want to generate sentence level embedding. I have a data-frame which has a text column.
I am using this code:
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
ckpt = "mrm8488/longformer-base-4096-finetuned-squadv2"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForQuestionAnswering.from_pretrained(ckpt)
text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this." # I will pas text-column here from my data-frame
#question = "What has Huggingface done ?"
encoding = tokenizer(question, text, return_tensors="pt")
# I don't want to use it for Question-Answer use-case. I just need the sentence embeddings
input_ids = encoding["input_ids"]
# default is local attention everywhere
# the forward method will automatically set global attention on question tokens
attention_mask = encoding["attention_mask"]
How can I do modification in the above code to generate embedding for sentences. ?
I have the following examples:
Text
i've added notes to the claim and it's been escalated for final review
after submitting the request you'll receive an email confirming the open request.
hello my name is person and i'll be assisting you
this is sam and i'll be assisting you for date.
I'll return the amount as asap.
ill return it to you.
The Longformer uses a local attention mechanism and you need to pass a global attention mask to let one token attend to all tokens of your sequence.
import torch
from transformers import LongformerTokenizer, LongformerModel
ckpt = "mrm8488/longformer-base-4096-finetuned-squadv2"
tokenizer = LongformerTokenizer.from_pretrained(ckpt)
model = LongformerModel.from_pretrained(ckpt)
text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this." # I will pas text-column here from my data-frame
#question = "What has Huggingface done ?"
encoding = tokenizer(text, return_tensors="pt")
global_attention_mask = [1].extend([0]*encoding["input_ids"].shape[-1])
encoding["global_attention_mask"] = global_attention_mask
# I don't want to use it for Question-Answer use-case. I just need the sentence embeddings
# default is local attention everywhere
# the forward method will automatically set global attention on question tokens
o = model(**encoding)
sentence_embedding = o.last_hidden_state[:,0]
You should keep in mind that mrm8488/longformer-base-4096-finetuned-squadv2 was not pre-trained to produce meaningful sentence embeddings and faces the same issues as the MLM pre-trained BERT's regarding sentence embeddings.

Using glove.6B.100d.txt embedding in spacy getting zero lex.rank

I am trying to load glove 100d emebddings in spacy nlp pipeline.
I create the vocabulary in spacy format as follows:
python -m spacy init-model en spacy.glove.model --vectors-loc glove.6B.100d.txt
glove.6B.100d.txt is converted to word2vec format by adding "400000 100" in the first line.
Now
spacy.glove.model/vocab has following files:
5468549 key2row
38430528 lexemes.bin
5485216 strings.json
160000128 vectors
In the code:
import spacy
nlp = spacy.load("en_core_web_md")
from spacy.vocab import Vocab
vocab = Vocab().from_disk('./spacy.glove.model/vocab')
nlp.vocab = vocab
print(len(nlp.vocab.strings))
print(nlp.vocab.vectors.shape) gives
gives
407174
(400000, 100)
However the problem is that:
V=nlp.vocab
max_rank = max(lex.rank for lex in V if lex.has_vector)
print(max_rank)
gives 0
I just want to use the 100d glove embeddings within spacy in combination with "tagger", "parser", "ner" models from en_core_web_md.
Does anyone know how to go about doing this correctly (is this possible)?
The tagger/parser/ner models are trained with the included word vectors as features, so if you replace them with different vectors you are going to break all those components.
You can use new vectors to train a new model, but replacing the vectors in a model with trained components is not going to work well. The tagger/parser/ner components will most likely provide nonsense results.
If you want 100d vectors instead of 300d vectors to save space, you can resize the vectors, which will truncate each entry to first 100 dimensions. The performance will go down a bit as a result.
import spacy
nlp = spacy.load("en_core_web_md")
assert nlp.vocab.vectors.shape == (20000, 300)
nlp.vocab.vectors.resize((20000, 100))

Cannot reproduce pre-trained word vectors from its vector_ngrams

Just curiosity, but I was debugging gensim's FastText code for replicating the implementation of Out-of-Vocabulary (OOV) words, and I'm not being able to accomplish it.
So, the process i'm following is training a tiny model with a toy corpus, and then comparing the resulting vectors of a word in the vocabulary. That means if the whole process is OK, the output arrays should be the same.
Here is the code I've used for the test:
from gensim.models import FastText
import numpy as np
# Default gensim's function for hashing ngrams
from gensim.models._utils_any2vec import ft_hash_bytes
# Toy corpus
sentences = [['hello', 'test', 'hello', 'greeting'],
['hey', 'hello', 'another', 'test']]
# Instatiate FastText gensim's class
ft = FastText(sg=1, size=5, min_count=1, \
window=2, hs=0, negative=20, \
seed=0, workers=1, bucket=100, \
min_n=3, max_n=4)
# Build vocab
ft.build_vocab(sentences)
# Fit model weights (vectors_ngram)
ft.train(sentences=sentences, total_examples=ft.corpus_count, epochs=5)
# Save model
ft.save('./ft.model')
del ft
# Load model
ft = FastText.load('./ft.model')
# Generate ngrams for test-word given min_n=3 and max_n=4
encoded_ngrams = [b"<he", b"<hel", b"hel", b"hell", b"ell", b"ello", b"llo", b"llo>", b"lo>"]
# Hash ngrams to its corresponding index, just as Gensim does
ngram_hashes = [ft_hash_bytes(n) % 100 for n in encoded_ngrams]
word_vec = np.zeros(5, dtype=np.float32)
for nh in ngram_hashes:
word_vec += ft.wv.vectors_ngrams[nh]
# Compare both arrays
print(np.isclose(ft.wv['hello'], word_vec))
The output of this script is False for every dimension of the compared arrays.
It would be nice if someone could point me out if i'm missing something or doing something wrong. Thanks in advance!
The calculation of a full word's FastText word-vector is not just the sum of its character n-gram vectors, but also a raw full-word vector that's also trained for in-vocabulary words.
The full-word vectors you get back from ft.wv[word] for known-words have already had this combination pre-calculated. See the adjust_vectors() method for an example of this full calculation:
https://github.com/RaRe-Technologies/gensim/blob/68ec5b8ed7f18e75e0b13689f4da53405ef3ed96/gensim/models/keyedvectors.py#L2282
The raw full-word vectors are in a .vectors_vocab array on the model.wv object.
(If this isn't enough to reconcile matters: ensure you're using the latest gensim, as there have been many recent FT fixes. And, ensure your list of ngram-hashes matches the output of the ft_ngram_hashes() method of the library – if not, your manual ngram-list-creation and subsequent hashing may be doing something different.)

How to find the score for sentence Similarity using Word2Vec

I am new to NLP, how to find the similarity between 2 sentences and also how to print scores of each word. And also how to implement the gensim word2Vec model.
Try this code:
here my two sentences :
sentence1="I am going to India"
sentence2=" I am going to Bharat"
from gensim.models import word2vec
import numpy as np
words1 = sentence1.split(' ')
words2 = sentence2.split(' ')
#The meaning of the sentence can be interpreted as the average of its words
sentence1_meaning = word2vec(words1[0])
count = 1
for w in words1[1:]:
sentence1_meaning = np.add(sentence1_meaning, word2vec(w))
count += 1
sentence1_meaning /= count
sentence2_meaning = word2vec(words2[0])
count = 1
for w in words2[1:]:
sentence2_meaning = np.add(sentence2_meaning, word2vec(w))
count += 1
sentence2_meaning /= count
#Similarity is the cosine between the vectors
similarity = np.dot(sentence1_meaning, sentence2_meaning)/(np.linalg.norm(sentence1_meaning)*np.linalg.norm(sentence2_meaning))
You can train the model and use the similarity function to get the cosine similarity between two words.
Here's a simple demo:
from gensim.models import Word2Vec
from gensim.test.utils import common_texts
model = Word2Vec(common_texts,
size = 500,
window = 5,
min_count = 1,
workers = 4)
word_vectors = model.wv
word_vectors.similarity('computer', 'computer')
The output will be 1.0, of course, which indicates 100% similarity.
After your from gensim.models import word2vec, word2vec is a Python module – not a function that you can call as word2vec(words1[0]) or word2vec(w).
So your code isn't even close to approaching this correctly, and you should review docs/tutorials which demonstrate the proper use of the gensim Word2Vec class & supporting methods, then mimic those.
As #david-dale mentions, there's a basic intro in the gensim docs for Word2Vec:
https://radimrehurek.com/gensim/models/word2vec.html
The gensim library also bundles within its docs/notebooks directory a number of Jupyter notebooks demonstrating various algorithms & techniques. The notebook word2vec.ipynb shows basic Word2Vec usage; you can also view it via the project's source code repository at...
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb
...however, it's really best to run as a local notebook, so you can step through the execution cell-by-cell, and try different variants yourself, perhaps even adapting it to use your data instead.
When you reach that level, note that:
these models require far more than just a few sentences as training - so ideally you'd either have (a) many sentences from the same domain as those you're comparing, so that the model can learn words in those contexts; (b) a model trained from a compatible corpus, which you then apply to your out-of-corpus sentences.
using the average of all the word-vectors in a sentence is just one relatively-simple way to make a vector for a longer text; there are many other more-sophisticated ways. One alternative very similar to Word2Vec is the 'Paragraph Vector' algorithm also available in gensim as the class Doc2Vec.

how to convert Word to vector using embedding layer in Keras

I am having a word embedding file as shown below click here to see the complete file in github.I would like to know the procedure for generating word embeddings So that i can generate word embedding for my personal dataset
in -0.051625 -0.063918 -0.132715 -0.122302 -0.265347
to 0.052796 0.076153 0.014475 0.096910 -0.045046
for 0.051237 -0.102637 0.049363 0.096058 -0.010658
of 0.073245 -0.061590 -0.079189 -0.095731 -0.026899
the -0.063727 -0.070157 -0.014622 -0.022271 -0.078383
on -0.035222 0.008236 -0.044824 0.075308 0.076621
and 0.038209 0.012271 0.063058 0.042883 -0.124830
a -0.060385 -0.018999 -0.034195 -0.086732 -0.025636
The 0.007047 -0.091152 -0.042944 -0.068369 -0.072737
after -0.015879 0.062852 0.015722 0.061325 -0.099242
as 0.009263 0.037517 0.028697 -0.010072 -0.013621
Google -0.028538 0.055254 -0.005006 -0.052552 -0.045671
New 0.002533 0.063183 0.070852 0.042174 0.077393
with 0.087201 -0.038249 -0.041059 0.086816 0.068579
at 0.082778 0.043505 -0.087001 0.044570 0.037580
over 0.022163 -0.033666 0.039190 0.053745 -0.035787
new 0.043216 0.015423 -0.062604 0.080569 -0.048067
I was able to convert each words in a dictionary to the above format by following the below steps:
initially represent each words in the dictionary by unique integer
take each integer one by one and perform array([[integer]]) and give it as input array in below code
then the word corresponding to integer and respective output vector can be stored to json file ( i used output_array.tolist() for storing the vector in json format)
import numpy as np
from keras.models import Sequential
from keras.layers import Embedding
model = Sequential()
model.add(Embedding(dictionary_size_here, sizeof_embedding_vector, input_length= input_length_here))
input_array = array([[integer]]) #each integer is fed one by one using a loop
model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)
Reference
How does Keras 'Embedding' layer work?
It is important to understand that there are multiple ways to generate an embedding for words. The popular word2vec, for example, can generate word embeddings using CBOW or Skip-grams.
Hence, one could have multiple "procedures" to generate word embeddings. One of the easier to understand method (albeit with its drawbacks) to generate an embedding is using Singular Value Decomposition (SVD). The steps are briefly described below.
Create a Term-Document matrix. i.e. terms as rows and the document it appears in as columns.
Perform SVD
Truncate the output vector for the term to n dimension. In your example above, n = 5.
You can have a look at this link for a more detailed description using word2vec's skipgram model to generate an embedding. Word2Vec Tutorial - The Skip-Gram Model.
For more information on SVD, you can look at this and this.

Resources