How to continue training Doc2Vec with a specific domain corpus after training with a generic corpus - doc2vec

I want to train a Doc2Vec model with a generic corpus and, then, continue training with a domain-specific corpus (I have read that is a common strategy and I want to test results).
I have all the documents, so I can build and tag the vocab at the beginning.
As I understand, I should train initially all the epochs with the generic docs, and then repeat the epochs with the ad hoc docs. But, this way, I cannot place all the docs in a corpus iterator and call train() once (as it is recommended everywhere).
So, after building the global vocab, I have created two iterators, the first one for the generic docs and the second one for the ad hoc docs, and called train() twice.
Is it the best way or it is a more appropriate way?
If the best, how I should manage alpha and min_alpha? Is it a good decision not to mention them in the train() calls and let the train() manage them?
Best
Alberto

This is probably not a wise strategy, because:
the Python Gensim Doc2Vec class hasn't ever properly supported expanding its known vocabulary after a 1st single build_vocab() call. (Up through at least 3.8.3, such attempts typically cause a Segmentation Fault process crash.) Thus if there are words that are only in your domain-corpus, an initial typical initialization/training on the generic-corpus would leave them out of the model entirely. (You could work around this, with some atypical extra steps, but the other concerns below would remain.)
if there is truly an important contrast between the words/word-senses used in your generic and the different words/word-senses used in your domain corpus, influence of the words from the generic corpus may not be beneficial, diluting domain-relevant meanings
further, any followup training that just uses a subset of all documents (the domain corpus) will only be updating the vectors for that subset of words/word-senses, and the model's internal weights used for further unseen-document inference, in directions that make sense for the domain-corpus alone. Such later-trained vectors may be nudged arbitrarily far out of comparable alignment with other words not appearing in the domain-corpus, and earlier-trained vectors will find themselves no longer tuned in relation to the model's later-updated internal-weights. (Exactly how far will depend on the learning-rate alpha & epochs choices in the followup training, and how well that followup training optimizes model loss.)
If your domain dataset is sufficient, or can be grown with more domain data, it may not be necessary to mix in other training steps/data. But if you think you must try that, the best-grounded approach would be to shuffle all training data together, and train in one session where all words are known from the beginning, and all training examples are presented in balanced, interleaved fashion. (Or possibly, where some training texts considered extra-important are oversampled, but still mixed in with the variety of all available documents, in all epochs.)
If you see an authoritative source suggesting such a "train with one dataset, then another disjoint dataset" approach with the Doc2Vec algorithms, you should press them for more details on what they did to make that work: exact code steps, and the evaluations which showed an improvement. (It's not impossible that there's some way to manage all the issues! But I've seen many vague impressions that this separate-pretraining is straightforward or beneficial, and zero actual working writeups with code and evaluation metrics showing that it's working.)
Update with respect to the additional clarifications you provided at https://stackoverflow.com/a/64865886/130288:
Even with that context, my recommendation remains: don't do this segmenting of training into two batches. It's almost certain to degrade the model compared to a combined training.
I would be interested to see links to the "references in the literature" you allude to. They may be confused or talking about algorithms other than the Doc2Vec ("Paragraph Vectors") algorithm.
If there is any reason to give your domain docs more weight, a better-grounded way would be to oversample them in the combined corpus.
Bu by all means, test all these variants & publish the relative results. If you're exploring shaky hypotheses, I would ignore any advice from StackOverflow-like sources & just run all the variants that your reading of the literature suggest, to see which, if any actually help.
You're right to recognized that the choice of alpha parameters is a murky area that could majorly influence what impact such add-on training has. There's no right answer, so you'll have to search-for and reason-out what might make sense. The inherent issues I've mentioned with such subset-followup-training could make it so that even if you find benefits in some combos, they may be more a product of a lucky combination of data & arbitrary parameters than a generalizable practice.
And: your specific question "if it is better to set such values or not provide them at all" reduces to: "do you want to use the default values, or values set when the model was created, or not?"
Which values might be workable, if at all, for this unproven technique is something that'd need to be experimentally discovered. That is, if you wanted to have comparable (or publishable) results here, I think you'd have to justify from your own novel work some specific strategy for choosing good alpha/epochs and other parameters, rather than adopt any practice merely recommended in a StackOverflow answer.

Related

Find list of Out Of Vocabulary (OOV) words from my domain spectific pdf while using FastText model

How to find list of Out Of Vocabulary (OOV) words from my domain spectific pdf while using FastText model? I need to fine tune FastText with my domain specific words.
A FastText model will already be able to generate vectors for OOV words.
So there's not necessarily any need to either list the specifically OOV words in your PDF, nor 'fine tune' as FastText model.
You just ask it for vectors, it gives them back. The vectors for full in-vocabulary words, that were trained from relevant training material, will likely be best, while vectors synthesized for OOV words from word-fragments (character n-grams) shared with training material will just be rough guesses - better than nothing, but not great.
(To train a good word-vector requires many varied examples of a word's use, interleaved with similarly good examples of its many 'peer' words – and traditionally, in one unified, balanced training session.)
If you think you need to do more, you should expand your questin with more details about why you think that's necessary, and what existing precedents (in docs/tutorials/papers) you're trying to match.
I've not seen a well-documented way to casually fine-tune, or incrementally expand the known-vocabulary of, an existing FastText model. There would be a lot of expert tradeoffs required, and in many cases simply training a new model with sufficient data is likely to be a safer approach.
Anyone seeking such fine-tuning should have a clear idea of:
what their incremental data might be able to add to an existing model
what process/code will they be using, and why that process/code might be expected to give meaningful results with their specific starting model & new data
how the results of any such process can be evaluated to ensure the extra fine-tuning steps are beneficial compared to alternatives

Emphasis on a feature while training a vanilla nn

I have some 360 odd features on which I am training my neural network model.
The accuracy I am getting is abysmally bad. There is one feature amongst the 360 that is more important than the others.
Right now, it does not enjoy any special status amongst the other features.
Is there a way to lay emphasis on one of the features while training the model? I believe this could improve my model's accuracy.
I am using Python 3.5 with Keras and Scikit-learn.
EDIT: I am attempting a regression problem
Any help would be appreciated
First of all, I would make sure that this feature alone has a decent prediction probability, but I am assuming that you already made sure of it.
Then, one approach that you could take, is to "embed" your 359 other features in a first layer, and only feed in your special feature once you have compressed the remaining information.
Contrary to what most tutorials make you believe, you do not have to add in all features already in the first layer, but can technically insert them at any point in time (or even multiple times).
The first layer that captures your other inputs is then some form of "PCA approximator", where you are embedding a high-dimensional feature space (359 dimensions) into something that is less dominant over your other feature (maybe 20-50 dimensions as a starting point?)
Of course there is no guarantee that this will work, but you might have a much better chance of getting attention on your special feature, although I am fairly certain that in general you should still see an increase in performance if the single feature is strongly enough correlated with your output.
The other question that is still open is the kind of task you are training for, i.e., whether you are doing some form of classification (if so, how many classes?), or regression. This might also influence architectural choices, and the amount of focus you can/should put on a single feature.
There are several feature selection and importance techniques in machine learning. Please follow this link.

Can we compare word vectors from different models using transfer learning?

I want to train two word2vec/GLoVe models on different corpora and then compare the vectors of a single word. I know that it makes no sense to do so as different models start at different random states, but what if we use pre-trained word vectors as the starting point. Can we assume that the two models will continue to build upon the pre-trained vectors by incorporating the respective domain-specific knowledge, and not go into completely different states?
Tried to find some research papers which discuss this problem, but couldn't find any.
Simply starting your models with pre-trained bectors would eliminate some of the randomness, but with each training epoch on your new corpora:
there's still randomness introduced by negative-sampling (if using that default mode), by frequent-word downsampling (if using default values of the sample parameter in word2vec), and by the interplay of different threads
each epoch with your new corpora will be pulling the word-vectors for present words to new, better positions for that corpora, but leaving original words unmoved. The net movements over many epochs could move words arbitrarily far from where they started, in response to the whole-corpus-effects on all words.
So, doing so wouldn't necessarily achieve your goal in a reliable (or theoretically-defensible) way, though it might kinda-work – at least better than starting from purely random initialization – especially if your corpora are small and you do few training epochs. (That's usually a bad idea – you want big varied training data and enough passes for extra passes to make little incremental difference. But doing those things "wrong" could make your results look "better" in this scenario, where you don't want your training to change the original coordinate-space "too much". I wouldn't rely on such an approach.)
Especially if the words you need to compare are a small subset of the total vocabulary, a couple things you could consider:
combine the corpora into one training corpus, shuffled together, but for those words you need to compare, replace them with corpora-specific tokens. For example, replace 'sugar' with 'sugar_c1' and 'sugar_c2' – leaving the vast majority of surrounding words to be the same tokens (and thus learn a single vector across the whole corpus). Then, the two variant tokens for the "same word" will learn different vectors, based on their differing contexts that still share many of the same tokens.
using some "anchor set" of words that you know (or confidently conjecture) either do mean the same across both contexts, or should mean the same, train two models but learn a transformation between the two space based on those guide words. Then, when you apply that transformation to other words, that weren't used to learn the transformation, they'll land in contrasting positions in each others' spaces, maybe achieving the comparison you need. This is a technique that's been used for language-to-language translation, and there's a helper class and example notebook included with the Python gensim library.
There may be other better approaches, these are just two quick ideas that might work without much change to existing libraries. A project like 'HistWords', which used word-vector training to try to track evolving changes in word-meaning over time, might also have ideas for usable techniques.

human-interpretable, meaningful clusters using doc2vec

I am clustering a set of education documents using doc2vec.
As a human, I think of these as in categories such as:
computer-related
language related
collaboration
arts
etc.
I wonder if there is a way to 'guide' the doc2vec clustering into a set of clusters that are human-interpretable.
One strategy I have been trying is to filter out all 'nonsense' words, and only train doc2vec on the words that seem meaningful. But of course, this seems to perhaps ruin the training.
Something just occurred to me that might work:
Train on entire documents (don't filter out words) to create doc2vec space
Filter nonsense words ('help', 'student', etc. are words that have very little meaning in this space) out of each document
Project filtered documents into doc2vec space
then process using k-means etc
I would appreciate any constructive suggestions or next steps.
best
Your plan is fine; you should try it to evaluate the results. The clusters may not map tightly to your preconceived groupings, but by looking at the example docs per cluster, you'll probably be able to form your own rough idea of what the cluster "is" in human-crafted descriptive terms.
Don't try too much guesswork preprocessing (like eliminating words) at first. Try those kinds of variations after you have the simplest possible approach working, as a baseline – so you can evaluate (even if only by ad hoc eyeballing) whether they're helping as expected. (For example, if a word like 'student' truly appears across all documents equally, it won't have much influence either way on Doc2Vec final doc coordinates... so you don't have to make that judgement call yourself, it'll just be deemphasized automatically.)
I'm assuming that by Doc2Vec you mean the 'Paragraph Vector' algorithm, as implemented by the Doc2Vec class in Python gensim. Some PV-Doc2Vec modes, including the default PV-DM (dm=1) and also the simpler PV-DBOW if you also enable concurrent word-training (dm=0, dbow_words=1), train word-vectors into the same space as doc-vectors. So the word-vectors that are closest to the doc-vectors in a cluster, or the cluster's centroid, might be useful as interpretable descriptions of the cluster.
(In the word-vector space, there's also research that tries to make the individual dimensions of word-vectors more-interpretable by constraining training in some way, such as requiring vectors to be spares with only non-negative dimensions. See for example this NNSE work and other papers like it. Presumably that might also be applicable to doc-vectors, but I don't know offhand any papers or libraries to do that.)
You could also apply other topic-modeling algorithms, like LDA, that calculate discrete 'topics' that are usually fairly interpretable, and report the strongest topics in each document. (You can cluster on the full doc-topics weights, or perhaps just naively assign each document to its one strongest topic as a simple kind of clustering.)

how to use build_vocab in gensim?

Build_vocab extend my old vocabulary?
For example, my idea is when I use doc2vec(s) to train a model, it just builds the vocabulary from the datasets. If I want to extend it, I need to use build_vocab()
Where should I use it? Should I put it after "gensim.doc2vec()"?
For example:
sentences = gensim.models.doc2vec.TaggedLineDocument(f_path)
dm_model = gensim.models.doc2vec.Doc2Vec(sentences, dm=1, size=300, window=8, min_count=5, workers=4)
dm_model.build_vocab()
You should follow working examples in gensim documentation/tutorials/notebooks or online tutorials to understand which steps are necessary and in what order.
In particular, if you provide your sentences corpus iterable on the Doc2Vec() initialization, it will automatically do both the vocabulary-discovery pass and all training – so you don’t then need to call either build_vocab() or train() yourself. And further, you would never call build_vocab() with no arguments. (No working example in docs or online will do what your code does – so don’t improvise new things until you’ve followed the examples and know why they do what they do.)
There is an optional update argument to build_vocab(), which purports to allow the expansion of a vocabulary from an earlier training session (in preparation for further training with the newer words). HOWEVER, it’s only been developed/tested with regard to Word2Vec models – there are reports it causes crashes when used with Doc2Vec. And even in Word2Vec, its overall effects and best-ways-to-use aren’t clear, across all training modes. So I don’t recommend its use except for experts who can read & interpret the source code, and many involved tradeoffs, on their own. If you receive a chunk of new texts, with new words, the best-grounded course of action, and easiest to evaluate/reason-about, is to re-train from scratch, using a combined corpus of all text examples.

Resources