Gensim train word2vec on wikipedia - preprocessing and parameters - nlp

I am trying to train the word2vec model from gensim using the Italian wikipedia
"http://dumps.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2"
However, I am not sure what is the best preprocessing for this corpus.
gensim model accepts a list of tokenized sentences.
My first try is to just use the standard WikipediaCorpus preprocessor from gensim. This extract each article, remove punctuation and split words on spaces. With this tool each sentence would correspond to an entire model, and I am not sure of the impact of this fact on the model.
After this I train the model with default parameters. Unfortunately after training it seems that I do not manage to obtain very meaningful similarities.
What is the most appropriate preprocessing on the Wikipedia corpus for this task? (if this questions are too broad please help me by pointing to a relevant tutorial / article )
This the code of my first trial:
from gensim.corpora import WikiCorpus
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
corpus = WikiCorpus('itwiki-latest-pages-articles.xml.bz2',dictionary=False)
max_sentence = -1
def generate_lines():
for index, text in enumerate(corpus.get_texts()):
if index < max_sentence or max_sentence==-1:
yield text
else:
break
from gensim.models.word2vec import BrownCorpus, Word2Vec
model = Word2Vec()
model.build_vocab(generate_lines()) #This strangely builds a vocab of "only" 747904 words which is << than those reported in the literature 10M words
model.train(generate_lines(),chunksize=500)

Your approach is fine.
model.build_vocab(generate_lines()) #This strangely builds a vocab of "only" 747904 words which is << than those reported in the literature 10M words
This could be because of pruning infrequent words (the default is min_count=5).
To speed up computation, you can consider "caching" the preprocessed articles as a plain .txt.gz file, one sentence (document) per line, and then simply using word2vec.LineSentence corpus. This saves parsing the bzipped wiki XML on every iteration.
Why word2vec doesn't produce "meaningful similarities" for Italian wiki, I don't know. English wiki seems to work fine. See also here.

I've been working on a project to massage the wikipedia corpus and get vectors out of it.
I might generate the Italian vectors soon but in case you want to do it on your own take a look at:
https://github.com/idio/wiki2vec

Related

What is nlp in spacy?

Usually we start from:
nlp = spacy.load('en_encore_web_sm') # or medium, or large
or
nlp = English()
then:
doc = nlp('my text')
Then we can do a lot of fun with that even not knowing the nature of the first line.
But what exactly is 'nlp'? What is going on under the hood? Is "nlp" a pretrained model, as understood in machine learning, and therefore some big file located somewhere on the disc?
I met an explanation, that 'nlp' is an 'object, containing process pipeline', but that only explains a little.
You can always check the type of any python objects:
nlp = spacy.load('en_encore_web_sm') # or medium, or large
print(type(nlp))
print(dir(nlp)) # view a list of attributes
You will get something like this (depending on the passed arguments)
<class 'spacy.lang.en.English'>
You are right it is something like 'pretrained' model as it contains vocabulary, binary weights, etc.
Please check the official documentation:
https://spacy.io/api/language
You could infer what nlp() is by exploring it. For example:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_lg")
text = "Elon Musk 889-888-8888 elonpie#tessa.net Jeff Bezos (345)123-1234 bezzi#zonbi.com Reshma Saujani example.email#email.com 888-888-8888 Barkevious Mingo"
text = nlp(text)
print(text)
Will print the exact same text. On the other hand if you do:
for word in text.ents:
print(word.text,word.label_)
you will get the entities of the string:
Elon Musk PERSON
889-888 CARDINAL
Jeff Bezos PERSON
345)123 CARDINAL
Reshma Saujani PERSON
It is indeed large pre-trained model for the English language and has many functions (parser, lemmatizer, tagger) as the one demonstrated above. Hope this helps a bit to clarify your question.

How to convert small dataset into word embeddings instead of one-hot encoding?

I have a dataset of 33 words that are a mix of verbs and nouns, for eg. father, sing, etc. I have tried converting them to 1-hot encoding but for my use case, it has been suggested to look into word2vec embedding. I have looked in gensim and glove but struggling to make it work.
How could I convert my data into an embedding? Such that two words that may be semantically closer may have a lesser distance between their respective vectors. How may this be achieved or any helpful material on the same?
Such as this
Since your dataset is quite small, and I'm assuming it doesn't contain any jargon, it's best to use a pre-trained model in order to save up on training time.
With gensim, it's as simple as:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')
The 'word2vec-google-news-300' model has been pre-trained on a part of the Google News Dataset and generalizes well enough to most tasks. Following this, you can create word embeddings/vectors like so:
vec = wv['father']
And, finally, for computing word similarity:
similarity_score = wv.similarity('father', 'sing')
Lastly, one major limitation of Word2Vec is it's inability to deal with words that are OOV(out of vocabulary). For such cases, it's best to train a custom model for your corpus.

Is it necessary to do stopwords removal ,Stemming/Lemmatization for text classification while using Spacy,Bert?

Is stopwords removal ,Stemming and Lemmatization necessary for text classification while using Spacy,Bert or other advanced NLP models for getting the vector embedding of the text ?
text="The food served in the wedding was very delicious"
1.since Spacy,Bert were trained on huge raw datasets are there any benefits of apply stopwords removal ,Stemming and Lemmatization on these text before generating the embedding using bert/spacy for text classification task ?
2.I can understand stopwords removal ,Stemming and Lemmatization will be good when we use countvectorizer,tfidf vectorizer to get embedding of sentences .
You can test to see if doing stemming lemmatization and stopword removal helps. It doesn't always. I usually do if I gonna graph as the stopwords clutter up the results.
A case for not using Stopwords
Using Stopwords will provide context to the user's intent, so when you use a contextual model like BERT. In such models like BERT, all stopwords are kept to provide enough context information like the negation words (not, nor, never) which are considered to be stopwords.
According to https://arxiv.org/pdf/1904.07531.pdf
"Surprisingly, the stopwords received as much attention as non-stop words, but removing them has no effect inMRR performances. "
With BERT you don't process the texts; otherwise, you lose the context (stemming, lemmatization) or change the texts outright (stop words removal).
Some more basic models (rule-based or bag-of-words) would benefit from some processing, but you must be very careful with stop words removal: many words that change the meaning of an entire sentence are stop words (not, no, never, unless).
Do not remove SW, as they add new information(context-awareness) to the sentence (viz., text summarization, machine/language translation, language modeling, question-answering)
Remove SW if we want only general idea of the sentence (viz., sentiment analysis, language/text classification, spam filtering, caption generation, auto-tag generation, topic/document
It's not mandatory. Removing stopwords can sometimes help and sometimes not. You should try both.

word2vec limit similar_by_vector() result to re-trained corpus

Assume you have a (wikipedia) pre-trained word2vec model, and train it on an additional corpus (very small, 1000 scentences).
Can you imagine a way to limit a vector-search to the "re-trained" corpus only?
For example
model.wv.similar_by_vector()
will simply find the closest word for a given vector, no matter if it is part of the Wikipedia corpus, or the re-trained vocabulary.
On the other hand, for 'word' search the concept exists:
most_similar_to_given('house',['garden','boat'])
I have tried to train based on the small corpus from scratch, and it somewhat works as expected. But of course could be much more powerful if the assigned vectors come from a pre-trained set.
Sharing an efficient way to do this manually:
re-train word2vec on the additional corpus
create full unique word-index of corpus
fetch re-trained vectors for each word in the index
instead of the canned function "similar_by_vector", use scipy.spatial.KDTree.query()
This finds the closest word within the given corpus only and works as expected.
Similar to the approach for creating a subset of doc-vectors in a new KeyedVectors instance suggested here, assuming small_vocab is a list of the words in your new corpus, you could try:
subset_vectors = WordEmbeddingsKeyedVectors(vector_size)
subset_vectors.add(small_vocab, w2v_model.wv[small_vocab])
Then subset_vectors contains just the words you've selected, but supports familiar operations like most_similar().

Biasing word2vec towards special corpus

I am new to stackoverflow. Please forgive my bad English.
I am using word2vec for a school project. I want to work with a domain specific corpus (like Physics Textbook) for creating the word vectors using Word2Vec. This standalone does not provide good results due to lesser size of the corpus. This especially hurts as we want to evaluate on words that may very well be outside the vocabulary of the text book.
We want the textbook to encode the domain specific relationships and semantic "nearness". "Quantum" and "Heisenberg" are especially close in this textbook for eg. which may not hold true for background corpus. To handle the generic words (like "any") we need the basic background model(like the one provided by Google on word2vec site).
Is there any way that we can supplant to the background model using our newer corpus. Just training on the corpus etc. doesnot work well.
Are there any attempts to combine vector representations from two corpus- general and specific. I could not find any in my searches.
Let's talk about gensim since you tagged you question with it. You can load a previously trained model in python using gensim. Then you continue training it. Would it be useful?
# load from previous gensim file:
model = gensim.models.Word2Vec.load(fname)
# or from word2vec c format:
# model = gensim.models.Word2Vec.load_word2vec_format('/path/vectors.bin', binary=True)
# continue training:
model.train(other_sentences)
model.save(fname)

Resources