Setting max length of char n-grams for fastText - nlp

I want to compare word2vec and fasttext model based on this comparison tutorial.
https://github.com/jayantj/gensim/blob/fast_text_notebook/docs/notebooks/Word2Vec_FastText_Comparison.ipynb
According to this, the semantic accuracy of fastText model increase when we set the max length of char n-grams to zero, such that fastText starts to behave almost like to word2vec. It ignores the ngrams.
However, I can not find any formation on how to set this parameter while loading a fastText model. Any ideas on how to do this?

The parameter is set at training time – and then the model is built using that parameter, and dependent on that parameter for interpretation. So you wouldn't typically change it upon loading an already-trained model, and there's no API in gensim (or the original FastText) to change the setting on an already-trained model.
(By looking at the source and tampering with the loaded model state directly, you might be able to approximate the effect of ignoring char-ngrams that had been trained – but that'd be a novel mode, not at all like the no-ngrams-trained mode evaluated in the notebook you've linked. It might generate interesting, or awful, results – no way to tell without trying it.)

Related

Having trouble training Word2Vec iteratively on Gensim

I'm attempting to train multiple texts supplied by myself iteratively. However, I keep running into an issue when I train the model more than once:
ValueError: You must specify either total_examples or total_words, for proper learning-rate and progress calculations. If you've just built the vocabulary using the same corpus, using the count cached in the model is sufficient: total_examples=model.corpus_count.
I'm currently initiating my model like this:
model = Word2Vec(sentences, min_count=0, workers=cpu_count())
model.build_vocab(sentences, update=False)
model.save('firstmodel.model')
model = Word2Vec.load('firstmodel.model')
and subsequently training it iteratively like this:
model.build_vocab(sentences, update = True)
model.train(sentences, totalexamples=model.corpus_count, epochs=model.epochs)
What am I missing here?
Somehow, it worked when I just trained one other model, so not sure why it doesn't work beyond two models...
First, the error message says you need to supply either the total_examples or total_words parameter to train() (so that it has an accurate estimate of the total training-corpus size).
Your code, as currently shown, only supplies totalexamples – a parameter name missing the necessary _. Correcting this typo should remedy the immediate error.
However, some other comments on your usage:
repeatedly calling train() with different data is an expert technique highly subject to error or other problems. It's not the usual way of using Word2Vec, nor the way most published results were reached. You can't count on it to always improve the model with new words; it might make the model worse, as new training sessions update some-but-not-all words, and alter the (usual) property that the vocabulary has one consistent set of word-frequencies from one single corpus. The best course is to train() once, with all available data, so that the full vocabulary, word-frequencies, & equally-trained word-vectors are achieved in a single consistent session.
min_count=0 is almost always a bad idea with word2vec: words with few examples in the corpus should be discarded. Trying to learn word-vectors for them not only gets weak vectors for those words, but dilutes/distracts the model from achieving better vectors for surrounding more-common words.
a count of workers up to your local cpu_count() only reliably helps up to about 4-12 workers, depending on other parameters & the efficiency of your corpus-reading, then more workers can hurt, due to inefficiencies in the Python GIL & Gensim corpus-to-worker handoffs. (inding the actual best count for your setup is, unfortunately, still just a matter of trial and error. But if you've got 16 (or more) cores, your setting is almost sure to do worse than a lower workers number.

Understanding the role of the function build_vocab in Doc2Vec

I have recently started studying Doc2Vec model.
I have understood its mechanism and how it works.
I'm trying to implement it using gensim framework.
I have transormed my training data into TaggedDocument.
But i have one question :
What is the role of this line model_dbow.build_vocab([x for x in tqdm(train_tagged.values)]) ?
is it to create random vectors that represent text ?
Thank you for your help
The Doc2Vec model needs to know several things about the training corpus before it is fully allocated & initialized.
First & foremost, the model needs to know the words present & their frequencies – a working vocabulary – so that it can determine the words that will remain after the min_count floor is applied, and allocate/initialize word-vectors & internal model structures for the relevant words. The word-frequencies will also be used to influence the random sampling of negative-word-examples (for the default negative-sampling mode) and the downsampling of very-frequent words (per the sample parameter).
Additionally, the model needs to know the rough size of the overall training set in order to gradually decrement the internal alpha learning-rate over the course of each epoch, and give meaningful progress-estimates in logging output.
At the end of build_vocab(), all memory/objects needed for the model have been created. Per the needs of the underlying algorithm, all vectors will have been initialized to low-magnitude random vectors to ready the model for training. (It essentially won't use any more memory, internally, through training.)
Also, after build_vocab(), the vocabulary is frozen: any words presented during training (or later inference) that aren't already in the model will be ignored.

Use pretrained embedding in Spanish with Torchtext

I am using Torchtext in an NLP project. I have a pretrained embedding in my system, which I'd like to use. Therefore, I tried:
my_field.vocab.load_vectors(my_path)
But, apparently, this only accepts the names of a short list of pre-accepted embeddings, for some reason. In particular, I get this error:
Got string input vector "my_path", but allowed pretrained vectors are ['charngram.100d', 'fasttext.en.300d', ..., 'glove.6B.300d']
I found some people with similar problems, but the solutions I can find so far are "change Torchtext source code", which I would rather avoid if at all possible.
Is there any other way in which I can work with my pretrained embedding? A solution that allows to use another Spanish pretrained embedding is acceptable.
Some people seem to think it is not clear what I am asking. So, if the title and final question are not enough: "I need help using a pre-trained Spanish word-embedding in Torchtext".
It turns out there is a relatively simple way to do this without changing Torchtext's source code. Inspiration from this Github thread.
1. Create numpy word-vector tensor
You need to load your embedding so you end up with a numpy array with dimensions (number_of_words, word_vector_length):
my_vecs_array[word_index] should return your corresponding word vector.
IMPORTANT. The indices (word_index) for this array array MUST be taken from Torchtext's word-to-index dictionary (field.vocab.stoi). Otherwise Torchtext will point to the wrong vectors!
Don't forget to convert to tensor:
my_vecs_tensor = torch.from_numpy(my_vecs_array)
2. Load array to Torchtext
I don't think this step is really necessary because of the next one, but it allows to have the Torchtext field with both the dictionary and vectors in one place.
my_field.vocab.set_vectors(my_field.vocab.stoi, my_vecs_tensor, word_vectors_length)
3. Pass weights to model
In your model you will declare the embedding like this:
my_embedding = toch.nn.Embedding(vocab_len, word_vect_len)
Then you can load your weights using:
my_embedding.weight = torch.nn.Parameter(my_field.vocab.vectors, requires_grad=False)
Use requires_grad=True if you want to train the embedding, use False if you want to freeze it.
EDIT: It looks like there is another way that looks a bit easier! The improvement is that apparently you can pass the pre-trained word vectors directly during the vocabulary-building step, so that takes care of steps 1-2 here.

semantic and syntactic performance of Doc2vec model

I am trying to check the semantic and syntactic performance of a doc2vec model- doc2vec_model.accuracy(questions-words), but it doesnt seem to function since models.deprecated.doc2vec – Deep learning with paragraph2vec, says it has been deprecated since version 3.3.0 in the gensim package.It gives this error message
AttributeError: 'Doc2Vec' object has no attribute 'accuracy'
Though it works with word2vec model well, is there any way I can get it done apart from doc2vec_model.accuracy(questions-words)? or it's impossible?
A few notes:
That 'accuracy()' test is only a test of word-vectors on analogy problems – an easy evaluation to run, used in a number of papers, but not the final authority on whether a set of word-vectors is better than others for a particular purpose. (When I've had a project-specific scoring method, sometimes the word-vectors that score best on project-specific goals don't score best on those analogies – especially if the word-vectors are being used for a classification or information-retrieval task.)
Further, the popular and fast PV-DBOW Doc2Vec mode (dm=0 in gensim) doesn't train word-vectors at all, unless you add another setting (dbow_words=1). Such untrained word-vectors will be in random locations, scoring awfully on the analogies-accuracy.
But, using either PV-DM (dm=1) mode, or adding dbow_words=1 to PV-DBOW, will get word-vectors from Doc2Vec, and you might still want to run the analogies test. Fortunately, analogy-evaluation options have been retained & even expanded on the KeyedVectors object that's held in the Doc2Vec wv property. You can call the old accuracy() method there:
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Word2VecKeyedVectors.accuracy
But there's also a slightly-different scoring evaluate_word_pairs():
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.evaluate_word_pairs
(And in the 4.0.0 release there'll be a [evaluate_word_analogies()][1] which replaces `accuracy().)

load Doc2Vec model and get new sentence's vectors for test

I have read lots of examples regarding doc2vec, but I couldn't find any answer. Like a real example, I want to build a model with doc2vec and then train it with some ML models. after that, how can I get the vector of a raw string with the exact trained Doc2vec model? because I need to predict with my ML model with the same size and logical vector
There are a collection of example Jupyter (aka IPython) notebooks in the gensim docs/notebooks directory. You can view them online at:
https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks
But they'll be in your gensim installation directory, if you can find that for your current working environment.
Those that include doc2vec in their name demonstrate the use of the Doc2Vec class. The most basic intro operates on the 'Lee' corpus that's bundled with gensim for use in its unit tests. (It's really too small for real Doc2Vec success, but by forcing smaller models and many training iterations the notebook just barely manages to get some consistent results.) See:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
It includes a section on inferring a vector for a new text:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
Note that inference is performed on a list of string tokens, not a raw string. And those tokens should have been preprocessed/tokenized the same way as the original training data for the model, so that the vocabularies are compatible. (Any unknown words in a new text are silently ignored.)
Note also that especially on short texts, it often helps to provide a much-larger-than-default value of the optional steps parameter to infer_vector() - say 50 or 200 rather than the default 5. It may also help to provide a starting alpha parameter more like the training default of 0.025 than the method-default of 0.1.

Resources