semantic and syntactic performance of Doc2vec model - python-3.x

I am trying to check the semantic and syntactic performance of a doc2vec model- doc2vec_model.accuracy(questions-words), but it doesnt seem to function since models.deprecated.doc2vec – Deep learning with paragraph2vec, says it has been deprecated since version 3.3.0 in the gensim package.It gives this error message
AttributeError: 'Doc2Vec' object has no attribute 'accuracy'
Though it works with word2vec model well, is there any way I can get it done apart from doc2vec_model.accuracy(questions-words)? or it's impossible?

A few notes:
That 'accuracy()' test is only a test of word-vectors on analogy problems – an easy evaluation to run, used in a number of papers, but not the final authority on whether a set of word-vectors is better than others for a particular purpose. (When I've had a project-specific scoring method, sometimes the word-vectors that score best on project-specific goals don't score best on those analogies – especially if the word-vectors are being used for a classification or information-retrieval task.)
Further, the popular and fast PV-DBOW Doc2Vec mode (dm=0 in gensim) doesn't train word-vectors at all, unless you add another setting (dbow_words=1). Such untrained word-vectors will be in random locations, scoring awfully on the analogies-accuracy.
But, using either PV-DM (dm=1) mode, or adding dbow_words=1 to PV-DBOW, will get word-vectors from Doc2Vec, and you might still want to run the analogies test. Fortunately, analogy-evaluation options have been retained & even expanded on the KeyedVectors object that's held in the Doc2Vec wv property. You can call the old accuracy() method there:
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Word2VecKeyedVectors.accuracy
But there's also a slightly-different scoring evaluate_word_pairs():
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.evaluate_word_pairs
(And in the 4.0.0 release there'll be a [evaluate_word_analogies()][1] which replaces `accuracy().)

Related

What does "fine-tuning of a BERT model" refer to?

I was not able to understand one thing , when it says "fine-tuning of BERT", what does it actually mean:
Are we retraining the entire model again with new data.
Or are we just training top few transformer layers with new data.
Or we are training the entire model but considering the pretrained weights as initial weight.
Or there is already few layers of ANN on top of transformer layers which is only getting trained keeping transformer weight freeze.
Tried Google but I am getting confused, if someone can help me on this.
Thanks in advance!
I remember reading about a Twitter poll with similar context, and it seems that most people tend to accept your suggestion 3. (or variants thereof) as the standard definition.
However, this obviously does not speak for every single work, but I think it's fairly safe to say that 1. is usually not included when talking about fine-tuning. Unless you have vast amounts of (labeled) task-specific data, this step would be referred to as pre-training a model.
2. and 4. could be considered fine-tuning as well, but from personal/anecdotal experience, allowing all parameters to change during fine-tuning has provided significantly better results. Depending on your use case, this is also fairly simple to experiment with, since freezing layers is trivial in libraries such as Huggingface transformers.
In either case, I would really consider them as variants of 3., since you're implicitly assuming that we start from pre-trained weights in these scenarios (correct me if I'm wrong).
Therefore, trying my best at a concise definition would be:
Fine-tuning refers to the step of training any number of parameters/layers with task-specific and labeled data, from a previous model checkpoint that has generally been trained on large amounts of text data with unsupervised MLM (masked language modeling).

How to continue training Doc2Vec with a specific domain corpus after training with a generic corpus

I want to train a Doc2Vec model with a generic corpus and, then, continue training with a domain-specific corpus (I have read that is a common strategy and I want to test results).
I have all the documents, so I can build and tag the vocab at the beginning.
As I understand, I should train initially all the epochs with the generic docs, and then repeat the epochs with the ad hoc docs. But, this way, I cannot place all the docs in a corpus iterator and call train() once (as it is recommended everywhere).
So, after building the global vocab, I have created two iterators, the first one for the generic docs and the second one for the ad hoc docs, and called train() twice.
Is it the best way or it is a more appropriate way?
If the best, how I should manage alpha and min_alpha? Is it a good decision not to mention them in the train() calls and let the train() manage them?
Best
Alberto
This is probably not a wise strategy, because:
the Python Gensim Doc2Vec class hasn't ever properly supported expanding its known vocabulary after a 1st single build_vocab() call. (Up through at least 3.8.3, such attempts typically cause a Segmentation Fault process crash.) Thus if there are words that are only in your domain-corpus, an initial typical initialization/training on the generic-corpus would leave them out of the model entirely. (You could work around this, with some atypical extra steps, but the other concerns below would remain.)
if there is truly an important contrast between the words/word-senses used in your generic and the different words/word-senses used in your domain corpus, influence of the words from the generic corpus may not be beneficial, diluting domain-relevant meanings
further, any followup training that just uses a subset of all documents (the domain corpus) will only be updating the vectors for that subset of words/word-senses, and the model's internal weights used for further unseen-document inference, in directions that make sense for the domain-corpus alone. Such later-trained vectors may be nudged arbitrarily far out of comparable alignment with other words not appearing in the domain-corpus, and earlier-trained vectors will find themselves no longer tuned in relation to the model's later-updated internal-weights. (Exactly how far will depend on the learning-rate alpha & epochs choices in the followup training, and how well that followup training optimizes model loss.)
If your domain dataset is sufficient, or can be grown with more domain data, it may not be necessary to mix in other training steps/data. But if you think you must try that, the best-grounded approach would be to shuffle all training data together, and train in one session where all words are known from the beginning, and all training examples are presented in balanced, interleaved fashion. (Or possibly, where some training texts considered extra-important are oversampled, but still mixed in with the variety of all available documents, in all epochs.)
If you see an authoritative source suggesting such a "train with one dataset, then another disjoint dataset" approach with the Doc2Vec algorithms, you should press them for more details on what they did to make that work: exact code steps, and the evaluations which showed an improvement. (It's not impossible that there's some way to manage all the issues! But I've seen many vague impressions that this separate-pretraining is straightforward or beneficial, and zero actual working writeups with code and evaluation metrics showing that it's working.)
Update with respect to the additional clarifications you provided at https://stackoverflow.com/a/64865886/130288:
Even with that context, my recommendation remains: don't do this segmenting of training into two batches. It's almost certain to degrade the model compared to a combined training.
I would be interested to see links to the "references in the literature" you allude to. They may be confused or talking about algorithms other than the Doc2Vec ("Paragraph Vectors") algorithm.
If there is any reason to give your domain docs more weight, a better-grounded way would be to oversample them in the combined corpus.
Bu by all means, test all these variants & publish the relative results. If you're exploring shaky hypotheses, I would ignore any advice from StackOverflow-like sources & just run all the variants that your reading of the literature suggest, to see which, if any actually help.
You're right to recognized that the choice of alpha parameters is a murky area that could majorly influence what impact such add-on training has. There's no right answer, so you'll have to search-for and reason-out what might make sense. The inherent issues I've mentioned with such subset-followup-training could make it so that even if you find benefits in some combos, they may be more a product of a lucky combination of data & arbitrary parameters than a generalizable practice.
And: your specific question "if it is better to set such values or not provide them at all" reduces to: "do you want to use the default values, or values set when the model was created, or not?"
Which values might be workable, if at all, for this unproven technique is something that'd need to be experimentally discovered. That is, if you wanted to have comparable (or publishable) results here, I think you'd have to justify from your own novel work some specific strategy for choosing good alpha/epochs and other parameters, rather than adopt any practice merely recommended in a StackOverflow answer.

human-interpretable, meaningful clusters using doc2vec

I am clustering a set of education documents using doc2vec.
As a human, I think of these as in categories such as:
computer-related
language related
collaboration
arts
etc.
I wonder if there is a way to 'guide' the doc2vec clustering into a set of clusters that are human-interpretable.
One strategy I have been trying is to filter out all 'nonsense' words, and only train doc2vec on the words that seem meaningful. But of course, this seems to perhaps ruin the training.
Something just occurred to me that might work:
Train on entire documents (don't filter out words) to create doc2vec space
Filter nonsense words ('help', 'student', etc. are words that have very little meaning in this space) out of each document
Project filtered documents into doc2vec space
then process using k-means etc
I would appreciate any constructive suggestions or next steps.
best
Your plan is fine; you should try it to evaluate the results. The clusters may not map tightly to your preconceived groupings, but by looking at the example docs per cluster, you'll probably be able to form your own rough idea of what the cluster "is" in human-crafted descriptive terms.
Don't try too much guesswork preprocessing (like eliminating words) at first. Try those kinds of variations after you have the simplest possible approach working, as a baseline – so you can evaluate (even if only by ad hoc eyeballing) whether they're helping as expected. (For example, if a word like 'student' truly appears across all documents equally, it won't have much influence either way on Doc2Vec final doc coordinates... so you don't have to make that judgement call yourself, it'll just be deemphasized automatically.)
I'm assuming that by Doc2Vec you mean the 'Paragraph Vector' algorithm, as implemented by the Doc2Vec class in Python gensim. Some PV-Doc2Vec modes, including the default PV-DM (dm=1) and also the simpler PV-DBOW if you also enable concurrent word-training (dm=0, dbow_words=1), train word-vectors into the same space as doc-vectors. So the word-vectors that are closest to the doc-vectors in a cluster, or the cluster's centroid, might be useful as interpretable descriptions of the cluster.
(In the word-vector space, there's also research that tries to make the individual dimensions of word-vectors more-interpretable by constraining training in some way, such as requiring vectors to be spares with only non-negative dimensions. See for example this NNSE work and other papers like it. Presumably that might also be applicable to doc-vectors, but I don't know offhand any papers or libraries to do that.)
You could also apply other topic-modeling algorithms, like LDA, that calculate discrete 'topics' that are usually fairly interpretable, and report the strongest topics in each document. (You can cluster on the full doc-topics weights, or perhaps just naively assign each document to its one strongest topic as a simple kind of clustering.)

how to use build_vocab in gensim?

Build_vocab extend my old vocabulary?
For example, my idea is when I use doc2vec(s) to train a model, it just builds the vocabulary from the datasets. If I want to extend it, I need to use build_vocab()
Where should I use it? Should I put it after "gensim.doc2vec()"?
For example:
sentences = gensim.models.doc2vec.TaggedLineDocument(f_path)
dm_model = gensim.models.doc2vec.Doc2Vec(sentences, dm=1, size=300, window=8, min_count=5, workers=4)
dm_model.build_vocab()
You should follow working examples in gensim documentation/tutorials/notebooks or online tutorials to understand which steps are necessary and in what order.
In particular, if you provide your sentences corpus iterable on the Doc2Vec() initialization, it will automatically do both the vocabulary-discovery pass and all training – so you don’t then need to call either build_vocab() or train() yourself. And further, you would never call build_vocab() with no arguments. (No working example in docs or online will do what your code does – so don’t improvise new things until you’ve followed the examples and know why they do what they do.)
There is an optional update argument to build_vocab(), which purports to allow the expansion of a vocabulary from an earlier training session (in preparation for further training with the newer words). HOWEVER, it’s only been developed/tested with regard to Word2Vec models – there are reports it causes crashes when used with Doc2Vec. And even in Word2Vec, its overall effects and best-ways-to-use aren’t clear, across all training modes. So I don’t recommend its use except for experts who can read & interpret the source code, and many involved tradeoffs, on their own. If you receive a chunk of new texts, with new words, the best-grounded course of action, and easiest to evaluate/reason-about, is to re-train from scratch, using a combined corpus of all text examples.

load Doc2Vec model and get new sentence's vectors for test

I have read lots of examples regarding doc2vec, but I couldn't find any answer. Like a real example, I want to build a model with doc2vec and then train it with some ML models. after that, how can I get the vector of a raw string with the exact trained Doc2vec model? because I need to predict with my ML model with the same size and logical vector
There are a collection of example Jupyter (aka IPython) notebooks in the gensim docs/notebooks directory. You can view them online at:
https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks
But they'll be in your gensim installation directory, if you can find that for your current working environment.
Those that include doc2vec in their name demonstrate the use of the Doc2Vec class. The most basic intro operates on the 'Lee' corpus that's bundled with gensim for use in its unit tests. (It's really too small for real Doc2Vec success, but by forcing smaller models and many training iterations the notebook just barely manages to get some consistent results.) See:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
It includes a section on inferring a vector for a new text:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
Note that inference is performed on a list of string tokens, not a raw string. And those tokens should have been preprocessed/tokenized the same way as the original training data for the model, so that the vocabularies are compatible. (Any unknown words in a new text are silently ignored.)
Note also that especially on short texts, it often helps to provide a much-larger-than-default value of the optional steps parameter to infer_vector() - say 50 or 200 rather than the default 5. It may also help to provide a starting alpha parameter more like the training default of 0.025 than the method-default of 0.1.

Resources