Gensim's Word2Vec not training provided documents - python-3.x

I'm facing a Gensim training problem using Word2Vec.
model.wv.vocab is not getting any further word from the trained corpus
the only words in are from the ones from initialization instruction !
In fact, after many times trying on my own code, even the official site's example didn't work !
I tried saving model at many spots in my code
I even tried saving and reloading the corpus alongside train instruction
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
path = get_tmpfile("word2vec.model")
model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
print(len(model.wv.vocab))
model.train([["hello", "world"]], total_examples=1, epochs=1)
model.save("word2vec.model")
print(len(model.wv.vocab))
first print statement gives 12 which is right
second 12 when it's supposed to give 14 (len(vocab + 'hello' + 'world'))

Additional calls to train() don't expand the known vocabulary. So, there is no way that the value of len(model.wv.vocab) will change after another call to train(). (Either 'hello' and 'world' are already known to the model, in which case they were in the original count of 12, or they weren't known, in which case they were ignored.)
The vocabulary is only established during a specific build_vocab() phase, which happens automatically if, as your code shows, you supplied a training corpus (common_texts) in model instantiation.
You can use a call to build_vocab() with the optional added parameter update=True to incrementally update a model's vocabulary, but this is best considered an advanced/experimental technique that introduces added complexities. (Whether such vocab-expansion, and then followup incremental training, actually helps or hurts will depend on getting a lot of other murky choices about alpha, epochs, corpus-sizing, training modes, and corpus-contents correct.)

Related

Having trouble training Word2Vec iteratively on Gensim

I'm attempting to train multiple texts supplied by myself iteratively. However, I keep running into an issue when I train the model more than once:
ValueError: You must specify either total_examples or total_words, for proper learning-rate and progress calculations. If you've just built the vocabulary using the same corpus, using the count cached in the model is sufficient: total_examples=model.corpus_count.
I'm currently initiating my model like this:
model = Word2Vec(sentences, min_count=0, workers=cpu_count())
model.build_vocab(sentences, update=False)
model.save('firstmodel.model')
model = Word2Vec.load('firstmodel.model')
and subsequently training it iteratively like this:
model.build_vocab(sentences, update = True)
model.train(sentences, totalexamples=model.corpus_count, epochs=model.epochs)
What am I missing here?
Somehow, it worked when I just trained one other model, so not sure why it doesn't work beyond two models...
First, the error message says you need to supply either the total_examples or total_words parameter to train() (so that it has an accurate estimate of the total training-corpus size).
Your code, as currently shown, only supplies totalexamples – a parameter name missing the necessary _. Correcting this typo should remedy the immediate error.
However, some other comments on your usage:
repeatedly calling train() with different data is an expert technique highly subject to error or other problems. It's not the usual way of using Word2Vec, nor the way most published results were reached. You can't count on it to always improve the model with new words; it might make the model worse, as new training sessions update some-but-not-all words, and alter the (usual) property that the vocabulary has one consistent set of word-frequencies from one single corpus. The best course is to train() once, with all available data, so that the full vocabulary, word-frequencies, & equally-trained word-vectors are achieved in a single consistent session.
min_count=0 is almost always a bad idea with word2vec: words with few examples in the corpus should be discarded. Trying to learn word-vectors for them not only gets weak vectors for those words, but dilutes/distracts the model from achieving better vectors for surrounding more-common words.
a count of workers up to your local cpu_count() only reliably helps up to about 4-12 workers, depending on other parameters & the efficiency of your corpus-reading, then more workers can hurt, due to inefficiencies in the Python GIL & Gensim corpus-to-worker handoffs. (inding the actual best count for your setup is, unfortunately, still just a matter of trial and error. But if you've got 16 (or more) cores, your setting is almost sure to do worse than a lower workers number.

Custom word-embeddings in gensim

I have a word embedding matrix (say M) obtained of order V x N where V is the size of the vocabulary and N is the size of each word vector. I want the word2vec model of gensim to initialise its word embedding matrix with M, during training. I am able to load M in the word2vec format using
gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format(model_file)
but I don't know how to feed M into the gensim word2vec model.
The Gensim Word2Vec model isn't designed to be pre-initialized with outside vectors, so there's no built-in helper methods.
But of course since the source is available, and all objects can be modified by your direct tampering, you can still change/replace its usual (random) initialization with anything of your own choosing, wiht a little effort.
Specifically, you'd first create a Word2Vec model using the constructor without yet supplying any training corpus. (If you supply a corpus, it will automatically do the next two build_vocab() and train() steps for you, and you don't want that.)
Then, you'd perform the necessary .build_vocab() step, allowing it to survey your training text data to discover its vocabulary with word-frequencies, and perform its usual model initialization:
model.build_vocab(corpus)
At this point, before doing any other training, you can tamper with the model to replace its random-initialization of words with your alternate word-vectors, as from the vectors you've loaded. If your other vectors are in the KeyedVectors variable loaded_kv, this could be as simple as:
for word in loaded_kv.index_to_key:
if word in model.wv:
model.kv[word] = loaded_kv[word]
Note that if your loaded_kv includes words that aren't in the corpus, or too rare (appear fewer than min_count times) in the corpus, the model will not have allocated a space for those vectors – as they won't be used in training – and they won't be part of the final model.
If for some reason you need them to be, you should ensure a sufficient number of valid usage examples of those words appear inside the corpus. You shouldn't add them to the model in a ways that changes the total number of vectors in model.wv after .build_vocab(), because the model is not expecting that sort of change, and errors/undefined-behavior are likely.
(You also shouldn't simply force extra words that aren't in the real training data into the model, because while they will remain unchanged through training, all other words will continue to be adjusted through training – meaning any words that weren't also incrementally adjusted, in an interleaved fashion, along with the rest may wind up essentially "incompatible" with the word-vectors that were co-trained in the same full model.)
After you've modified the model's initialization to match your preferences, continue with normal training, something like:
model.train(corpus, total_examples=model.corpus_count, epochs=model.epochs)

When doing pre-training of a transformer model, how can I add words to the vocabulary?

Given a DistilBERT trained language model for a given language, taken from the Huggingface hub, I want to pre-train the model on a specific domain, and I want to add new words that are:
definitely non existing in the original training set
and impossible to handle via word piece toeknization - basically you can think of these words as "codes" that are a normalized form of a named entity
Consider that:
I would like to avoid to learn a new tokenizer: I am fine to add the new words, and then let the model learn their embeddings via pre-training
the number of the "words" is way larger that the "unused" tokens in the "stock" vocabulary
The only advice that I have found is the one reported here:
Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.
Do you think this is the only way of achieve my goal?
If yes, I do not have any idea of how to write this "script": does someone has some hints at how to proceeed (sample code, documentation etc)?
As per my comment, I'm assuming that you go with a pre-trained checkpoint, if only to "avoid [learning] a new tokenizer."
Also, the solution works with PyTorch, which might be more suitable for such changes. I haven't checked Tensorflow (which is mentioned in one of your quotes), so no guarantees that this works across platforms.
To solve your problem, let us divide this into two sub-problems:
Adding the new tokens to the tokenizer, and
Re-sizing the token embedding matrix of the model accordingly.
The first can actually be achieved quite simply by using .add_tokens(). I'm referencing the slow tokenizer's implementation of it (because it's in Python), but from what I can see, this also exists for the faster Rust-based tokenizers.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# Will return an integer corresponding to the number of added tokens
# The input could also be a list of strings instead of a single string
num_new_tokens = tokenizer.add_tokens("dennlinger")
You can quickly verify that this worked by looking at the encoded input ids:
print(tokenizer("This is dennlinger."))
# 'input_ids': [101, 2023, 2003, 30522, 1012, 102]
The index 30522 now corresponds to the new token with my username, so we can check the first part. However, if we look at the function docstring of .add_tokens(), it also says:
Note, hen adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of the model so that its embedding matrix matches the tokenizer.
In order to do that, please use the PreTrainedModel.resize_token_embeddings method.
Looking at this particular function, the description is a bit confusing, but we can get a correctly resized matrix (with randomly initialized weights for new tokens), by simply passing the previous model size, plus the number of new tokens:
from transformers import AutoModel
model = AutoModel.from_pretrained("distilbert-base-uncased")
model.resize_token_embeddings(model.config.vocab_size + num_new_tokens)
# Test that everything worked correctly
model(**tokenizer("This is dennlinger", return_tensors="pt"))
EDIT: Notably, .resize_token_embeddings() also takes care of any associated weights; this means, if you are pre-training, it will also adjust the size of the language modeling head (which should have the same number of tokens), or fix tied weights that would be affected by an increased number of tokens.

load Doc2Vec model and get new sentence's vectors for test

I have read lots of examples regarding doc2vec, but I couldn't find any answer. Like a real example, I want to build a model with doc2vec and then train it with some ML models. after that, how can I get the vector of a raw string with the exact trained Doc2vec model? because I need to predict with my ML model with the same size and logical vector
There are a collection of example Jupyter (aka IPython) notebooks in the gensim docs/notebooks directory. You can view them online at:
https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks
But they'll be in your gensim installation directory, if you can find that for your current working environment.
Those that include doc2vec in their name demonstrate the use of the Doc2Vec class. The most basic intro operates on the 'Lee' corpus that's bundled with gensim for use in its unit tests. (It's really too small for real Doc2Vec success, but by forcing smaller models and many training iterations the notebook just barely manages to get some consistent results.) See:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
It includes a section on inferring a vector for a new text:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
Note that inference is performed on a list of string tokens, not a raw string. And those tokens should have been preprocessed/tokenized the same way as the original training data for the model, so that the vocabularies are compatible. (Any unknown words in a new text are silently ignored.)
Note also that especially on short texts, it often helps to provide a much-larger-than-default value of the optional steps parameter to infer_vector() - say 50 or 200 rather than the default 5. It may also help to provide a starting alpha parameter more like the training default of 0.025 than the method-default of 0.1.

Training a CNN with pre-trained word embeddings is very slow (TensorFlow)

I'm using TensorFlow (0.6) to train a CNN on text data. I'm using a method similar to the second option specified in this SO thread (with the exception that the embeddings are trainable). My dataset is pretty small and the vocabulary is around 12,000 words. When I train using random word embeddings everything works nicely. However, when I switch to the pre-trained embeddings from the word2vec site, the vocabulary grows to over 3,000,000 words and training iterations become over 100 times slower. I'm also seeing this warning:
UserWarning: Converting sparse IndexedSlices to a dense Tensor with
900482700 elements
I saw the discussion on this TensorFlow issue, but I'm still not sure if the slowdown I'm experiencing is expected or if it's a bug. I'm using the Adam optimizer but it's pretty much the same thing with Adagrad.
One workaround I guess I could try is to train using a minimal embedding matrix with only the ~12,000 words in my dataset, serialize the resulting embeddings and at runtime merge them with the remaining words from the pre-trained embeddings. I think this should work but it sounds hacky.
Is that currently the best solution or am I missing something?
So there were two issues here:
As mrry pointed out in his comment to the question, the warning was not a result of a conversion during the updates. Rather, I was calculating summary statistics (sparsity and histogram) on the embeddings gradient and that caused the conversion.
Interestingly, removing the summaries made the message go away, but the code remained slow. Per the TensorFlow issue referenced in the question, I had to also replace the AdamOptimizer with the AdagradOptimizer and once I did that the runtime was back on par with the one obtained from a small vocabulary.

Resources