Doc2Vec: Do we need to train model with utils.shuffle? - doc2vec

I am new to NLP and Doc2Vec. I noted some website train the Doc2Vec by shuffling the training data in each epoch (option 1), while some website use option 2. In option 2, there is no shuffling of training data
What is the difference? Also how do I select the optimal alpha? Thank you
### Option 1 ###
for epoch in range(30):
model_dbow.train(utils.shuffle([x for x in tqdm(train_tagged.values)]), total_examples=len(train_tagged.values), epochs=1)
model_dbow.alpha -= 0.002
model_dbow.min_alpha = model_dbow.alpha
vs
### Option 2 ###
model_dbow.train(train_tagged.values, total_examples=len(train_tagged.values), epochs=30)

If your corpus might have some major difference-in-character between early & late documents – such as certain words/topics that are all front-loaded to early docs, or all back-loaded in later docs – then performing one shuffle up-front to eliminate any such pattern may help a little. It's not strictly necessary & its effects on end results will likely be small.
Re-shuffling between every training pass is not common & I wouldn't expect it to offer a detectable benefit justifying its cost/code-complexity.
Regarding your "Option 1" vs "Option 2": Don't call train() multiple times in your own loop unless you're an expert who knows exactly why you're doing that. (And: any online example suggesting that is often a poor/buggy one.)

Related

Having trouble training Word2Vec iteratively on Gensim

I'm attempting to train multiple texts supplied by myself iteratively. However, I keep running into an issue when I train the model more than once:
ValueError: You must specify either total_examples or total_words, for proper learning-rate and progress calculations. If you've just built the vocabulary using the same corpus, using the count cached in the model is sufficient: total_examples=model.corpus_count.
I'm currently initiating my model like this:
model = Word2Vec(sentences, min_count=0, workers=cpu_count())
model.build_vocab(sentences, update=False)
model.save('firstmodel.model')
model = Word2Vec.load('firstmodel.model')
and subsequently training it iteratively like this:
model.build_vocab(sentences, update = True)
model.train(sentences, totalexamples=model.corpus_count, epochs=model.epochs)
What am I missing here?
Somehow, it worked when I just trained one other model, so not sure why it doesn't work beyond two models...
First, the error message says you need to supply either the total_examples or total_words parameter to train() (so that it has an accurate estimate of the total training-corpus size).
Your code, as currently shown, only supplies totalexamples – a parameter name missing the necessary _. Correcting this typo should remedy the immediate error.
However, some other comments on your usage:
repeatedly calling train() with different data is an expert technique highly subject to error or other problems. It's not the usual way of using Word2Vec, nor the way most published results were reached. You can't count on it to always improve the model with new words; it might make the model worse, as new training sessions update some-but-not-all words, and alter the (usual) property that the vocabulary has one consistent set of word-frequencies from one single corpus. The best course is to train() once, with all available data, so that the full vocabulary, word-frequencies, & equally-trained word-vectors are achieved in a single consistent session.
min_count=0 is almost always a bad idea with word2vec: words with few examples in the corpus should be discarded. Trying to learn word-vectors for them not only gets weak vectors for those words, but dilutes/distracts the model from achieving better vectors for surrounding more-common words.
a count of workers up to your local cpu_count() only reliably helps up to about 4-12 workers, depending on other parameters & the efficiency of your corpus-reading, then more workers can hurt, due to inefficiencies in the Python GIL & Gensim corpus-to-worker handoffs. (inding the actual best count for your setup is, unfortunately, still just a matter of trial and error. But if you've got 16 (or more) cores, your setting is almost sure to do worse than a lower workers number.

How to get negative word samples in Gensim Word2Vec Model?

I am using gensim Word2Vec model to train word embeddings. My code is:
w2v_model = Word2Vec(min_count=20,
window=2,
vector_size=50,
sample=6e-5,
alpha=0.03,
min_alpha=0.0007,
negative=20,
workers=cores-1)
w2v_model.build_vocab(sentences, progress_per=10000)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=50, report_delay=1)
I wonder whether I can access the negative and positive word samples during the process?
Thanks in advance.
Deep inside the training loops, for each individual 'center' word in the training texts that is to be predicted – a micro-training-example for the shallow neural-net – a different set of negative words will be chosen.
Those negative-words will be used for just that one set of forward/backward neural-net nudges, then discarded when training moves to the next word.
There's no way to access them other than changing that core code – which is actually written in Cython, & re-compiled into a native library after any changes. (It's a bit harder to tinker with than pure Python code.)
You can see where the exact choice-of-negative samples happens in the source code for one of the modes (CBOW w/ negative-sampling) here:
https://github.com/RaRe-Technologies/gensim/blob/91175ddc7e3d6f3a2af245c20af21ec3bf5e360f/gensim/models/word2vec_inner.pyx#L427
If you just need a representative set of negative-words, you could copy these steps in your own code.
If you want to know (& potentially log?) the negative words chosen for every positive prediction, I suspect that's a misguided idea:
Meaningful analysis of this algorithm's behavior won't depend on either individual micro-examples, nor the arbitrarily-random negative words chosen over all training. The interesting properties only arise from the tug-of-war happening across the interplay of all training.
As this is very deep in the training loops, even the most-efficient extra-steps, as a function of the negative-words, would slow things down a lot. Or, in the case of logging, result in 20x (for window=20) more logged-negative-words than your original training corpus. For the kinds of large corpora where this algorithm works well, such a slowdown/log could be onerous; for tiny toy-sized examples, this algorithm won't be working interestingly at all.
So the mere question, if you truly want a peek at all the (random, arbitrary) negative words during the process, suggests you may be going down a questionable path.
It'd be easier for me to imagine just wanting to see a representative set of the negatively-sampled words - because any 10, or 10,000, or 1,000,000 such randomly-chosen words are as good as any other, and the algorithm (on adequately-sized data) is robust against usual variance in which negative-words are actually chosen. And for that, you could just run the same sampling-process outside the training.
Separately: those are odd non-default choices for alpha & min_alpha - values that usually don't need any tweaking, and if tweaked should really only be done so with a conscious plan, driven by quantitaive evaluations comparing the results of alternate values. But, those specific odd unmotivated values are pretty common in some of the worst online tutorials. So beware where you're learning about word2vec!

Gensim doc2vec file stream training worse performance

Recently I switched to gensim 3.6 and the main reason was the optimized training process, which streams the training data directly from file, thus avoiding the GIL performance penalties.
This is how I used to trin my doc2vec:
training_iterations = 20
d2v = Doc2Vec(vector_size=200, workers=cpu_count(), alpha=0.025, min_alpha=0.00025, dm=0)
d2v.build_vocab(corpus)
for epoch in range(training_iterations):
d2v.train(corpus, total_examples=d2v.corpus_count, epochs=d2v.iter)
d2v.alpha -= 0.0002
d2v.min_alpha = d2v.alpha
And it is classifying documents quite well, only draw back is that when it is trained CPUs are utilized at 70%
So the new way:
corpus_fname = "spped.data"
save_as_line_sentence(corpus, corpus_fname)
# Choose num of cores that you want to use (let's use all, models scale linearly now!)
num_cores = cpu_count()
# Train models using all cores
d2v_model = Doc2Vec(corpus_file=corpus_fname, workers=num_cores, dm=0, vector_size=200, epochs=50)
Now all CPUs are utilized at 100%
but the model is performing very poorly.
According to the documentation, I should not use the train method also, I should use only epoch count and not iterations, also the min_aplpha and aplha values should not be touched.
The configuration of both Doc2Vec looks the same to me so is there an issue with my new set up or configuration, or there is something wrong with the new version of gensim?
P.S I am using the same corpus in both cases, also I tried epoch count = 100, also with smaller numbers like 5-20, but I had no luck
EDIT: First model was doing 20 iterations 5 epoch each, second was doing 50 epoch, so having the second model make 100 epochs made it perform even better, since I was no longer managing the alpha by myself.
About the second issue that popped up: when providing file with line documents, the doc ids were not always corresponding to the lines, I didn't manage to figure out what could be causing this, it seems to work fine for small corpus, If I find out what I am doing wrong I will update this answer.
The final configuration for corpus of size 4GB looks like this
d2v = Doc2Vec(vector_size=200, workers=cpu_count(), alpha=0.025, min_alpha=0.00025, dm=0)
d2v.build_vocab(corpus)
d2v.train(corpus, total_examples=d2v.corpus_count, epochs=100)
Most users should not be calling train() more than once in their own loop, where they try to manage the alpha & iterations themselves. It is too easy to do it wrong.
Specifically, your code where you call train() in a loop is doing it wrong. Whatever online source or tutorial you modeled this code on, you should stop consulting, as it's misleading or outdated. (The notebooks bundled with gensim are better examples on which to base any code.)
Even more specifically: your looping code is actually doing 100 passes over the data, 20 of your outer loops, then the default d2v.iter 5 times each call to train(). And your first train() call is smoothly decaying the effective alpha from 0.025 to 0.00025, a 100x reduction. But then your next train() call uses a fixed alpha of 0.0248 for 5 passes. Then 0.0246, etc, until your last loop does 5 passes at alpha=0.0212 – not even 80% of the starting value. That is, the lowest alpha will have been reached early in your training.
Call the two options exactly the same except for the way the corpus_file is specified, instead of an iterable corpus.
You should get similar results from both corpus forms. (If you had a reproducible test case where the same corpus gets very different-quality results, and there wasn't some other error, that could be worth reporting to gensim as a bug.)
If the results for both aren't as good as when you were managing train() and alpha wrongly, it would likely be because you aren't doing a comparable amount of total training.

Classification training with only positive sentences

I'm starting a project to build an automated fact checking classificator nad I have some doubts about the process to follow.
I've a database of ~1000 sentences, each one being a fact check positive. In order to build a supervised machine learning model I'll need to have a big set of tagged sentences with the true/false result depending if it's a fact check candidate sentence or not. That would require a lot of time and effort, so I'd like to first get results (with less accuracy I guess) without doing that.
My idea is to use the already tagged positive sentences and apply a PoS tagger to them. This would give me interesting information to spot some patterns like the most common words(e.g: raised, increase, won) and the post tags (e.g. verbs in past/present tense, time and numerals).
With this results I'm thinking about assigning weights in order to analyze new unclassified sentences. The problem is that the weight assignment would be done by me in an "heuristical" way. It'd be best to use the results of the PoS tagger to train some model which assigns probabilities in a more sophisticated way.
Could you give me some pointers if there's a way to accomplish this?
I read about Maximum Entropy Classifiers and statistical parsers but I really don't know if they're the right choice.
Edit (I think it'd be better to give more details):
Parsing the sentences with a PoS tagger will give me some useful information about each one of them, allowing me to filter them and weighting them using some custom metrics.
For example:
There are one million more people in poverty than five years ago -> indicatives of a fact check candidate sentence: verb in present tense, numerals and dates, (than) comparison.
We will increase the GDP by 3% the following year -> indicatives of a NOT fact check candidate sentence: it's in the future tense (indicative of some sort of prediction)
This situation happens often when the true sentences are relatively rare in the data.
1) Get a corpus of sentences that resemble what you will be classifying in the end. The corpus will contain both true and false sentences. Label them as false or non-fact check. We are assuming they are all false even though we know this is not the case. You want the ratio of true/false data created to be approximately its actual distribution if at all possible. So if 10% are true in real data then your assumed false cases are 90% or 9,000 for you 1,000 trues. If you don't know the distribution then just make it 10x or more.
2) Train logistic regression classifier aka maximum entropy on the data with cross validation. Keep track of the high scoring false positives on the held out data.
3) Re-annotate false positives down to what ever score makes sense for possibly being true positives. This will hopefully clean your assumed false data.
4) Keep running this process until you are no longer improving the classifier.
5) To get your "fact check words" then make sure your feature extractor is feeding words to your classifier and look for those that are positively associated with the true category--any decent logistic regression classifier should provide the feature weights in some way. I use LingPipe which certainly does.
6) I don't see how PoS (Part of Speech) helps with this problem.
This approach will fail to find true instances that are very different from the training data but it can work none the less.
Breck

Sklearn overfitting

I have a data set containing 1000 points each with 2 inputs and 1 output. It has been split into 80% for training and 20% for testing purpose. I am training it using sklearn support vector regressor. I have got 100% accuracy with training set but results obtained with test set are not good. I think it may be because of overfitting. Please can you suggest me something to solve the problem.
You may be right: if your model scores very high on the training data, but it does poorly on the test data, it is usually a symptom of overfitting. You need to retrain your model under a different situation. I assume you are using train_test_split provided in sklearn, or a similar mechanism which guarantees that your split is fair and random. So, you will need to tweak the hyperparameters of SVR and create several models and see which one does best on your test data.
If you look at the SVR documentation, you will see that it can be initiated using several input parameters, each of which could be set to a number of different values. For the simplicity, let's assume you are only dealing with two parameters that you want to tweak: 'kernel' and 'C', while keeping the third parameter 'degree' set to 4. You are considering 'rbf' and 'linear' for kernel, and 0.1, 1, 10 for C. A simple solution is this:
for kernel in ('rbf', 'linear'):
for c in (0.1, 1, 10):
svr = SVR(kernel=kernel, C=c, degree=4)
svr.fit(train_features, train_target)
score = svr.score(test_features, test_target)
print kernel, c, score
This way, you can generate 6 models and see which parameters lead to the best score, which will be the best model to choose, given these parameters.
A simpler way is to let sklearn to do most of this work for you, using GridSearchCV (or RandomizedSearchCV):
parameters = {'kernel':('linear', 'rbf'), 'C':(0.1, 1, 10)}
clf = GridSearchCV(SVC(degree=4), parameters)
clf.fit(train_features, train_target)
print clf.best_score_
print clf.best_params_
model = clf.best_estimator_ # This is your model
I am working on a little tool to simplify using sklearn for small projects, and make it a matter of configuring a yaml file, and letting the tool do all the work for you. It is available on my github account. You might want to take a look and see if it helps.
Finally, your data may not be linear. In that case you may want to try using something like PolynomialFeatures to generate new nonlinear features based on the existing ones and see if it improves your model quality.
Try fitting your data using training data split Sklearn K-Fold cross-validation, this provides you a fair split of data and better model , though at a cost of performance , which should really matter for small dataset and where the priority is accuracy.
A few hints:
Since you have only two inputs, it would be great if you plot your data. Try either a scatter with alpha = 0.3 or a heatmap.
Try GridSearchCV, as mentioned by #shahins.
Especially, try different values for the C parameter. As mentioned in the docs, if you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
If it's taking too long, you can also try RandomizedSearchCV
As a side note from #shahins answer (I am not allowed to add comments), both implementations are not equivalent. GridSearchCV is better since it performs cross-validation in the training set for tuning the hyperparameters. Do not use the test set for tuning hyperparameters!
Don't forget to scale your data

Resources