Understanding the role of the function build_vocab in Doc2Vec - nlp

I have recently started studying Doc2Vec model.
I have understood its mechanism and how it works.
I'm trying to implement it using gensim framework.
I have transormed my training data into TaggedDocument.
But i have one question :
What is the role of this line model_dbow.build_vocab([x for x in tqdm(train_tagged.values)]) ?
is it to create random vectors that represent text ?
Thank you for your help

The Doc2Vec model needs to know several things about the training corpus before it is fully allocated & initialized.
First & foremost, the model needs to know the words present & their frequencies – a working vocabulary – so that it can determine the words that will remain after the min_count floor is applied, and allocate/initialize word-vectors & internal model structures for the relevant words. The word-frequencies will also be used to influence the random sampling of negative-word-examples (for the default negative-sampling mode) and the downsampling of very-frequent words (per the sample parameter).
Additionally, the model needs to know the rough size of the overall training set in order to gradually decrement the internal alpha learning-rate over the course of each epoch, and give meaningful progress-estimates in logging output.
At the end of build_vocab(), all memory/objects needed for the model have been created. Per the needs of the underlying algorithm, all vectors will have been initialized to low-magnitude random vectors to ready the model for training. (It essentially won't use any more memory, internally, through training.)
Also, after build_vocab(), the vocabulary is frozen: any words presented during training (or later inference) that aren't already in the model will be ignored.

Related

Having trouble training Word2Vec iteratively on Gensim

I'm attempting to train multiple texts supplied by myself iteratively. However, I keep running into an issue when I train the model more than once:
ValueError: You must specify either total_examples or total_words, for proper learning-rate and progress calculations. If you've just built the vocabulary using the same corpus, using the count cached in the model is sufficient: total_examples=model.corpus_count.
I'm currently initiating my model like this:
model = Word2Vec(sentences, min_count=0, workers=cpu_count())
model.build_vocab(sentences, update=False)
model.save('firstmodel.model')
model = Word2Vec.load('firstmodel.model')
and subsequently training it iteratively like this:
model.build_vocab(sentences, update = True)
model.train(sentences, totalexamples=model.corpus_count, epochs=model.epochs)
What am I missing here?
Somehow, it worked when I just trained one other model, so not sure why it doesn't work beyond two models...
First, the error message says you need to supply either the total_examples or total_words parameter to train() (so that it has an accurate estimate of the total training-corpus size).
Your code, as currently shown, only supplies totalexamples – a parameter name missing the necessary _. Correcting this typo should remedy the immediate error.
However, some other comments on your usage:
repeatedly calling train() with different data is an expert technique highly subject to error or other problems. It's not the usual way of using Word2Vec, nor the way most published results were reached. You can't count on it to always improve the model with new words; it might make the model worse, as new training sessions update some-but-not-all words, and alter the (usual) property that the vocabulary has one consistent set of word-frequencies from one single corpus. The best course is to train() once, with all available data, so that the full vocabulary, word-frequencies, & equally-trained word-vectors are achieved in a single consistent session.
min_count=0 is almost always a bad idea with word2vec: words with few examples in the corpus should be discarded. Trying to learn word-vectors for them not only gets weak vectors for those words, but dilutes/distracts the model from achieving better vectors for surrounding more-common words.
a count of workers up to your local cpu_count() only reliably helps up to about 4-12 workers, depending on other parameters & the efficiency of your corpus-reading, then more workers can hurt, due to inefficiencies in the Python GIL & Gensim corpus-to-worker handoffs. (inding the actual best count for your setup is, unfortunately, still just a matter of trial and error. But if you've got 16 (or more) cores, your setting is almost sure to do worse than a lower workers number.

Custom word-embeddings in gensim

I have a word embedding matrix (say M) obtained of order V x N where V is the size of the vocabulary and N is the size of each word vector. I want the word2vec model of gensim to initialise its word embedding matrix with M, during training. I am able to load M in the word2vec format using
gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format(model_file)
but I don't know how to feed M into the gensim word2vec model.
The Gensim Word2Vec model isn't designed to be pre-initialized with outside vectors, so there's no built-in helper methods.
But of course since the source is available, and all objects can be modified by your direct tampering, you can still change/replace its usual (random) initialization with anything of your own choosing, wiht a little effort.
Specifically, you'd first create a Word2Vec model using the constructor without yet supplying any training corpus. (If you supply a corpus, it will automatically do the next two build_vocab() and train() steps for you, and you don't want that.)
Then, you'd perform the necessary .build_vocab() step, allowing it to survey your training text data to discover its vocabulary with word-frequencies, and perform its usual model initialization:
model.build_vocab(corpus)
At this point, before doing any other training, you can tamper with the model to replace its random-initialization of words with your alternate word-vectors, as from the vectors you've loaded. If your other vectors are in the KeyedVectors variable loaded_kv, this could be as simple as:
for word in loaded_kv.index_to_key:
if word in model.wv:
model.kv[word] = loaded_kv[word]
Note that if your loaded_kv includes words that aren't in the corpus, or too rare (appear fewer than min_count times) in the corpus, the model will not have allocated a space for those vectors – as they won't be used in training – and they won't be part of the final model.
If for some reason you need them to be, you should ensure a sufficient number of valid usage examples of those words appear inside the corpus. You shouldn't add them to the model in a ways that changes the total number of vectors in model.wv after .build_vocab(), because the model is not expecting that sort of change, and errors/undefined-behavior are likely.
(You also shouldn't simply force extra words that aren't in the real training data into the model, because while they will remain unchanged through training, all other words will continue to be adjusted through training – meaning any words that weren't also incrementally adjusted, in an interleaved fashion, along with the rest may wind up essentially "incompatible" with the word-vectors that were co-trained in the same full model.)
After you've modified the model's initialization to match your preferences, continue with normal training, something like:
model.train(corpus, total_examples=model.corpus_count, epochs=model.epochs)

Training Doc2vec with new data

I have a doc2vec model trained on documents with labels. I'm trying to continue training my model with model.train(). The new data comes with new labels as well, but, when I train it on more documents, the new labels aren't being recorded... Does anyone know what my problem might be?
Gensim's Doc2Vec only learns its set of tags at the same time it learns the corpus vocabulary of unique words – during the first call to .build_vocab() on the original corpus.
When you train with additional examples that have either words or tags that aren't already known to the model, those words or tags are simply ignored.
(The .build_vocab(…, update=True) option that's available on Word2Vec to expand its vocabulary has never been fully applied to Doc2Vec, either with respect to tags or with respect to a longstanding crashing bug. So it's not supported on Doc2Vec.)
Note that if it is your aim to create document-vectors that assist in some downstream-classification task, you may not want to supply your known-labels as tags, or at least not as a document's only tag.
The tags you supply to Doc2Vec are the units for which it learns vectors. If you have a million text examples, but only 5 different labels, if you feed those million examples into training each with only the label as a tag, the model is only learning 5 doc-vectors. It is, essentially, like you're training on only 5 mega-documents (passed in in chunks) – and thus 'summarizing' each label down to a single point in vector-space, when it might be far more useful to think of a label as covering a irregularly-shaped "point cloud".
So, you might instead want to use document-IDs rather than labels. (Or, labels and document-IDs.) Then, use the many varied vectors from all individual documents – rather than single vectors per label – to train some downstream classifier or clusterer.
And in that case, the arrival of documents with new labels might not require a full Doc2Vec-retraining. Instead, if the new documents still get useful vectors from inference on the older Doc2Vec model, those per-doc vectors may reflect enough about the new label's documents that downstream classifiers can learn to recognize them.
Ultiamtely, though, if you acquire much more training data, reflecting all new vocabularies & word-senses, the safest approach is to retrain a Doc2Vec model from scratch, using all data. Simply incremental training, even if it had official support, risks pulling those words/tags that appear in new data arbitrarily out-of-comparable-alignment with words/tags that were only trained in the original dataset. It is the interleaved co-training, alongside all other examples equally, which pushes-and-pulls all vectors in a model into useful relative arrangements.

Finding both target and center word2vec matrices

I've read and heard(In the CS224 of Stanford) that the Word2Vec algorithm actually trains two matrices(that is, two sets of vectors.) These two are the U and the V set, one for words being a target and one for words being the context. The final output is the average of these two.
I have two questions in mind. one is that:
Why do we get an average of two vectors? Why it makes sense? Don't we lose some information?
The second question is, using pre-trained word2vec models, how can I get access to both matrices? Is there any downloadable word2vec with both sets of vectors? I don't have enough resources to train a new one.
Thanks
That relayed description isn't quite right. The word-vectors traditionally retrieved from a word2vec model come from a "projection matrix" which converts individual words to a right-sized input-vector for the shallow neural network.
(You could think of the projection matrix as turning a one-hot encoding into a dense-embedding for that word, but libraries typically implement this via a dictionary-lookup – eg: "what row of the vectors-matrix should I consult for this word-token?")
There's another matrix of weights leading to the model's output nodes, whose interpretation varies based on the training mode. In the common default of negative-sampling, there's one node per known word, so you could also interpret this matrix as having a vector per word. (In hierarchical-softmax mode, the known-words aren't encoded as single output nodes, so it's harder to interpret the relationship of this matrix to individual words.)
However, this second vector per word is rarely made directly available by libraries. Most commonly, the word-vector is considered simply the trained-up input vector, from the projection matrix. For example, the export format from Google's original word2vec.c release only saves-out those vectors, and the large "GoogleNews" vector set they released only has those vectors. (There's no averaging with the other output-side representation.)
Some work, especially that of Mitra et all of Microsoft Research (in "Dual Embedding Space Models" & associated writeups) has noted those output-side vectors may be of value in some applications as well – but I haven't seen much other work using those vectors. (And, even in that work, they're not averaged with the traditional vectors, but consulted as a separate option for some purposes.)
You'd have to look at the code of whichever libraries you're using to see if you can fetch these from their full post-training model representation. In the Python gensim library, this second matrix in the negative-sampling case is a model property named syn1neg, following the naming of the original word2vec.c.

What is the stochastic aspect of Word2Vec?

I'm vectorizing words on a few different corpora with Gensim and am getting results that are making me rethink how Word2Vec functions. My understanding was that Word2Vec was deterministic, and that the position of a word in a vector space would not change from training to training. If "My cat is running" and "your dog can't be running" are the two sentences in the corpus, then the value of "running" (or its stem) seems necessarily fixed.
However, I've found that that value indeed does vary across models, and words keep changing where they are on a vector space when I train the model. The differences are not always hugely meaningful, but they do indicate the existence of some random process. What am I missing here?
This is well-covered in the Gensim FAQ, which I quote here:
Q11: I've trained my Word2Vec/Doc2Vec/etc model repeatedly using the exact same text corpus, but the vectors are different each time. Is there a bug or have I made a mistake? (*2vec training non-determinism)
Answer: The *2vec models (word2vec, fasttext, doc2vec…) begin with random initialization, then most modes use additional randomization
during training. (For example, the training windows are randomly
truncated as an efficient way of weighting nearer words higher. The
negative examples in the default negative-sampling mode are chosen
randomly. And the downsampling of highly-frequent words, as controlled
by the sample parameter, is driven by random choices. These
behaviors were all defined in the original Word2Vec paper's algorithm
description.)
Even when all this randomness comes from a
pseudorandom-number-generator that's been seeded to give a
reproducible stream of random numbers (which gensim does by default),
the usual case of multi-threaded training can further change the exact
training-order of text examples, and thus the final model state.
(Further, in Python 3.x, the hashing of strings is randomized each
re-launch of the Python interpreter - changing the iteration ordering
of vocabulary dicts from run to run, and thus making even the same
string-of-random-number-draws pick different words in different
launches.)
So, it is to be expected that models vary from run to run, even
trained on the same data. There's no single "right place" for any
word-vector or doc-vector to wind up: just positions that are at
progressively more-useful distances & directions from other vectors
co-trained inside the same model. (In general, only vectors that were
trained together in an interleaved session of contrasting uses become
comparable in their coordinates.)
Suitable training parameters should yield models that are roughly as
useful, from run-to-run, as each other. Testing and evaluation
processes should be tolerant of any shifts in vector positions, and of
small "jitter" in the overall utility of models, that arises from the
inherent algorithm randomness. (If the observed quality from
run-to-run varies a lot, there may be other problems: too little data,
poorly-tuned parameters, or errors/weaknesses in the evaluation
method.)
You can try to force determinism, by using workers=1 to limit
training to a single thread – and, if in Python 3.x, using the
PYTHONHASHSEED environment variable to disable its usual string hash
randomization. But training will be much slower than with more
threads. And, you'd be obscuring the inherent
randomness/approximateness of the underlying algorithms, in a way that
might make results more fragile and dependent on the luck of a
particular setup. It's better to tolerate a little jitter, and use
excessive jitter as an indicator of problems elsewhere in the data or
model setup – rather than impose a superficial determinism.
While I don't know any implementation details of Word2Vec in gensim, I do know that, in general, Word2Vec is trained by a simple neural network with an embedding layer as the first layer. The weight matrix of this embedding layer contains the word vectors that we are interested in.
This being said, it is in general also quite common to initialize the weights of a neural network randomly. So there you have the origin of your randomness.
But how can the results be different, regardless of different (random) starting conditions?
A well trained model will assign similar vectors to words that have similar meaning. This similarity is measured by the cosine of the angle between the two vectors. Mathematically speaking, if v and w are the vectors of two very similar words then
dot(v, w) / (len(v) * len(w)) # this formula gives you the cosine of the angle between v and w
will be close to 1.
Also, it will allow you to do arithmetics like the famous
king - man + woman = queen
For illustration purposes imagine 2D-vectors. Would these arithmetical properties get lost if you e.g. rotate everything by some angle around the origin? With a little mathematical background I can assure you: No, they won't!
So, your assumption
If "My cat is running" and "your dog can't be running" are the two
sentences in the corpus, then the value of "running" (or its stem)
seems necessarily fixed.
is wrong. The value of "running" is not fixed at all. What is (somehow) fixed, however, is the similarity (cosine) and arithmetical relationship to other words.

Resources