Recently I switched to gensim 3.6 and the main reason was the optimized training process, which streams the training data directly from file, thus avoiding the GIL performance penalties.
This is how I used to trin my doc2vec:
training_iterations = 20
d2v = Doc2Vec(vector_size=200, workers=cpu_count(), alpha=0.025, min_alpha=0.00025, dm=0)
d2v.build_vocab(corpus)
for epoch in range(training_iterations):
d2v.train(corpus, total_examples=d2v.corpus_count, epochs=d2v.iter)
d2v.alpha -= 0.0002
d2v.min_alpha = d2v.alpha
And it is classifying documents quite well, only draw back is that when it is trained CPUs are utilized at 70%
So the new way:
corpus_fname = "spped.data"
save_as_line_sentence(corpus, corpus_fname)
# Choose num of cores that you want to use (let's use all, models scale linearly now!)
num_cores = cpu_count()
# Train models using all cores
d2v_model = Doc2Vec(corpus_file=corpus_fname, workers=num_cores, dm=0, vector_size=200, epochs=50)
Now all CPUs are utilized at 100%
but the model is performing very poorly.
According to the documentation, I should not use the train method also, I should use only epoch count and not iterations, also the min_aplpha and aplha values should not be touched.
The configuration of both Doc2Vec looks the same to me so is there an issue with my new set up or configuration, or there is something wrong with the new version of gensim?
P.S I am using the same corpus in both cases, also I tried epoch count = 100, also with smaller numbers like 5-20, but I had no luck
EDIT: First model was doing 20 iterations 5 epoch each, second was doing 50 epoch, so having the second model make 100 epochs made it perform even better, since I was no longer managing the alpha by myself.
About the second issue that popped up: when providing file with line documents, the doc ids were not always corresponding to the lines, I didn't manage to figure out what could be causing this, it seems to work fine for small corpus, If I find out what I am doing wrong I will update this answer.
The final configuration for corpus of size 4GB looks like this
d2v = Doc2Vec(vector_size=200, workers=cpu_count(), alpha=0.025, min_alpha=0.00025, dm=0)
d2v.build_vocab(corpus)
d2v.train(corpus, total_examples=d2v.corpus_count, epochs=100)
Most users should not be calling train() more than once in their own loop, where they try to manage the alpha & iterations themselves. It is too easy to do it wrong.
Specifically, your code where you call train() in a loop is doing it wrong. Whatever online source or tutorial you modeled this code on, you should stop consulting, as it's misleading or outdated. (The notebooks bundled with gensim are better examples on which to base any code.)
Even more specifically: your looping code is actually doing 100 passes over the data, 20 of your outer loops, then the default d2v.iter 5 times each call to train(). And your first train() call is smoothly decaying the effective alpha from 0.025 to 0.00025, a 100x reduction. But then your next train() call uses a fixed alpha of 0.0248 for 5 passes. Then 0.0246, etc, until your last loop does 5 passes at alpha=0.0212 – not even 80% of the starting value. That is, the lowest alpha will have been reached early in your training.
Call the two options exactly the same except for the way the corpus_file is specified, instead of an iterable corpus.
You should get similar results from both corpus forms. (If you had a reproducible test case where the same corpus gets very different-quality results, and there wasn't some other error, that could be worth reporting to gensim as a bug.)
If the results for both aren't as good as when you were managing train() and alpha wrongly, it would likely be because you aren't doing a comparable amount of total training.
Related
I'm using gensim's Word2Vec for a recommendation-like task with part of my evaluation being the use of callbacks and the most_similar() method. However, I am noticing a huge disparity between the final few epoch callbacks and that of immediately post-training. In fact, the last epoch callback may often appear worthless, while the post training result is as best as could be desired.
My during-training tracking of most similar entries utilizes gensim's CallbackAny2Vec class. It follows the doc example fairly directly and roughly looks like:
class EpochTracker(CallbackAny2Vec):
def __init__(self):
self.epoch = 0
def on_epoch_begin(self, model):
print("Epoch #{} start".format(self.epoch))
def on_epoch_end(self, model):
print('Some diagnostics')
# Multiple terms used in the below
e = model.wv
print(e.most_similar(positive=['some term'])[0:3]) # grab the top 3 examples for some term
print("Epoch #{} end".format(self.epoch))
self.epoch += 1
As the epochs progress, the most_similar() results given by the callbacks to not seem to indicate an advancement of learning and seem erratic. In fact, often the callback from the first epoch shows the best result.
Counterintuitively, I also have an additional process (not shown) built into the callback that does indicate gradual learning. Following the similarity step, I take the current model's vectors and evaluate them against a down-stream task. In brief, this process is a sklearn GridSearchCV logistic regression check against some known labels.
I find that often the last on_epoch_end callback appears to be garbage. Or perhaps some multi-threading shenanigans. However, if directly after training the model I try the similarity call again:
e = e_model.wv # e_model was the variable assignment of the model overall
print(e.most_similar(positive=['some term'])[0:3])
I tend to get beautiful results that are in agreement with the downstream evaluation task also used in the callbacks, or are at least vastly different than that of the final epoch end.
I suspect I am missing something painfully apparent or most_similar() has an unusual behavior with epoch-end callbacks. Is this a known issue or is my approach flawed?
What version of Gensim are you using?
In older versions – pre 4.0 if I remember correctly? – the most_similar() operation relies on a cached pre-computed set of unit-normalized word-vectors that in some cases will be frozen when you first try a most_similar().
Thus, incremental updates to vectors won't be reflected in results, unless something happens to flush that cache - which happens at then end of training. But, since mid-training checks weren't an originally-envisioned usage, more-frequent flushing doesn't happen unless forced.
I think if you're sure to use the latest Gensim, the problem may go away - or reviewing this older project issue may provide ideas if you're stuck on an older version: https://github.com/RaRe-Technologies/gensim/issues/2260
(Your other mid-training learning process – if it's accessing the non-normalized per-word vectors directly, rather than via most_similar() – is likely succeeding because it's skipping that normed-vector cache.)
I'm attempting to train multiple texts supplied by myself iteratively. However, I keep running into an issue when I train the model more than once:
ValueError: You must specify either total_examples or total_words, for proper learning-rate and progress calculations. If you've just built the vocabulary using the same corpus, using the count cached in the model is sufficient: total_examples=model.corpus_count.
I'm currently initiating my model like this:
model = Word2Vec(sentences, min_count=0, workers=cpu_count())
model.build_vocab(sentences, update=False)
model.save('firstmodel.model')
model = Word2Vec.load('firstmodel.model')
and subsequently training it iteratively like this:
model.build_vocab(sentences, update = True)
model.train(sentences, totalexamples=model.corpus_count, epochs=model.epochs)
What am I missing here?
Somehow, it worked when I just trained one other model, so not sure why it doesn't work beyond two models...
First, the error message says you need to supply either the total_examples or total_words parameter to train() (so that it has an accurate estimate of the total training-corpus size).
Your code, as currently shown, only supplies totalexamples – a parameter name missing the necessary _. Correcting this typo should remedy the immediate error.
However, some other comments on your usage:
repeatedly calling train() with different data is an expert technique highly subject to error or other problems. It's not the usual way of using Word2Vec, nor the way most published results were reached. You can't count on it to always improve the model with new words; it might make the model worse, as new training sessions update some-but-not-all words, and alter the (usual) property that the vocabulary has one consistent set of word-frequencies from one single corpus. The best course is to train() once, with all available data, so that the full vocabulary, word-frequencies, & equally-trained word-vectors are achieved in a single consistent session.
min_count=0 is almost always a bad idea with word2vec: words with few examples in the corpus should be discarded. Trying to learn word-vectors for them not only gets weak vectors for those words, but dilutes/distracts the model from achieving better vectors for surrounding more-common words.
a count of workers up to your local cpu_count() only reliably helps up to about 4-12 workers, depending on other parameters & the efficiency of your corpus-reading, then more workers can hurt, due to inefficiencies in the Python GIL & Gensim corpus-to-worker handoffs. (inding the actual best count for your setup is, unfortunately, still just a matter of trial and error. But if you've got 16 (or more) cores, your setting is almost sure to do worse than a lower workers number.
This is maybe a more general question, but I cannot find information about this anywhere.
There are a lot of tutorials how to train your model in DDP, and that seems to work for me fine. However, once the training is done, how do you do the evaluation?
When train on 2 nodes with 4 GPUs each, and have dist.destroy_process_group() after training, the evaluation is still done 8 times, with 8 different results. In my training loop I save the model using torch.save(model.state_dict(),'ClassifierModel.pt'), only on rank 0. Then for the testing loop I load the model using model.load_state_dict(torch.load('./ClassifierModel.pt', map_location={'cuda:%d' % 0: 'cuda:%d' % device})). Additionally, for the testing I set test_sampler=None.
If needed I could provide more details of my code but I only listed the things which I think might be relevant.
I use machine learning algorithm in Malware analysis. When I input some features, I get strange training time. For example:
4 feature(A,B,C,D), model training time is 3 seconds.
3 Features(A,B,C), training time is 5 seconds.
2 features(A, B), training time is 8 seconds.
1 feature(A), training time is 4 seconds.
This kind of result happens on both MLP and Random Forest. In my opinion, the training time should be faster if I use less features, but the result is complete different.
In KNN, the result will be like these:
If I using 6,5,4,3 features(A,B,C,D,E,F), model testing time is about 1.1 seconds, almost the same.
2 features(A,B), model testing time is 3 seconds.
1 feature (A), model testing time is 5 seconds.
My dataset has 17K records and using 10-Fold cross-validation. The feature is sort by their entropy, feature A have highest entropy and feature F is lowest. Using Google Colab with sklearn for the testing. I tried several times in different date, and the trend is the same.
The feature of my dataset has total 79 features, the appearance only happens with few features.
Thanks for anyone who reply me, I have no idea about it.
It does seem at first glance that having fewer features will result in lower training times. However, depending on which algorithm is being used, this may not be the case. In training, an objective function (loss function) is being minimized by the algorithm. Taking the case of the MLP neural network, if you change the features (especially depending on whether they're informative or not), you're changing the feature space (or "error surface") over which the optimization occurs and possibly the minima of the function will be harder to find, resulting in more steps and longer training in order to satisfy the convergence criteria.
I am trying to use the TensorFlow object detection API to recognize a specific object (guitars) in pictures and videos.
As for the data, I downloaded the images from the OpenImage dataset, and derived the .tfrecord files. I am testing with different numbers, but for now let's say I have 200 images in the training set and 100 in the evaluation one.
I'm traininig the model using the "ssd_mobilenet_v1_coco" as a starting point, and the "model_main.py" script, so that I can have training and validation results.
When I visualize the training progress in TensorBoard, I get the following results for train:
and validation loss:
respectively.
I am generally new to computer vision and trying to learn, so I was trying to figure out the meaning of these plots.
The training loss goes as expected, decreasing over time.
In my (probably simplistic) view, I was expecting the validation loss to start at high values, decrease as training goes on, and then start increasing again if the training goes on for too long and the model starts overfitting.
But in my case, I don't see this behavior for the validation curve, which seems to be trending upwards basically all the time (excluding fluctuations).
Have I been training the model for too little time to see the behavior I'm expecting? Are my expectations wrong in the first place? Am I misinterpreting the curves?
Ok, I fixed it by decreasing the initial_learning_rate from 0.004 to 0.0001.
It was the obvious solution, considering the wild oscillations of the validation loss, but at first I thought it wouldn't work since there seems to be already a learning rate scheduler in the config file.
However, immediately below (in the config file) there's a num_steps option, and it's stated that
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the pets dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
Honestly, I don't remember if I commented out the num_steps option...if I didn't, it seems my learning rate was kept to the initial value of 0.004, which turned out to be too high.
If I did comment it out (so that the learning scheduler was active), I guess that, instead of the decrease, it still started from too high of a value.
Anyway, it's working much better now, I hope this can be useful if anyone is experiencing the same problem.