According to the MALLET documentation, it's possible to train topic models incrementally:
"-output-model [FILENAME] This option specifies a file to write a
serialized MALLET topic trainer object. This type of output is
appropriate for pausing and restarting training"
I'd like to train topics on one set of data and then increment the model with a different set of data. After both training steps, I'd like to output states for both datasets (with --output-state). Here is how I try to do it:
# training on the first dataset
../mallet-2.0.7/bin/mallet import-dir --input input/ --keep-sequence --output input.mallet
../mallet-2.0.7/bin/mallet train-topics --input input.mallet --num-topics 3 --output-state topic-state.gz --output-model model
# training on the second dataset
../mallet-2.0.7/bin/mallet import-dir --input input2/ --keep-sequence --output input2.mallet --use-pipe-from input.mallet
../mallet-2.0.7/bin/mallet train-topics --input input2.mallet --num-topics 3 --num-iterations 100 --output-state topic-state2.gz --input-model model
In the last command, if I add " --input-model model", the data from the 2nd dataset is not present in the output-state file. If I don't add it, the data from the 1st dataset is not present in the output-state file.
If I try to add additional instances to a model in the code:
model.addInstances(instances);
model.setNumThreads(2);
model.setNumIterations(50);
model.estimate();
[...]
model.addInstances(instances2);
model.setNumThreads(2);
model.setNumIterations(50);
model.estimate();
I get an error:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 30
at cc.mallet.topics.ParallelTopicModel.buildInitialTypeTopicCounts(ParallelTopicModel.java:364)
at cc.mallet.topics.ParallelTopicModel.addInstances(ParallelTopicModel.java:276)
at cc.mallet.examples.TopicModel2.main(TopicModel2.java:66)
There have been similar questions on the MALLET list before: http://permalink.gmane.org/gmane.comp.ai.mallet.devel/924, http://permalink.gmane.org/gmane.comp.ai.mallet.devel/2139
So is incremental training of topic models possible?
I think you were part of this conversation thread which may be useful for you now.
http://comments.gmane.org/gmane.comp.ai.mallet.devel/2153
Related
I'm using the pytorch-forecasting library (which is based on pytorch-lightning) for running a TFT model on time series forecasting. My training routine is segregated into three different tasks. At first I perform HPO using optuna, then I do a training+validation, and in the end, a retraining with full data (no validation).
Currently, both training+validation and retraining are happening using fresh models from scratch, so the runtime is quite high. So, I'm trying to reduce the run-time of the whole training routine by trying to leverage incremental-training where I'll load the checkpointed trained model from phase 2 and retrain it for smaller epochs on phase 3.
I have a method fit_model() which is used in both training/validation and retraining, but with different args. The core part of my fit() looks something like the following:
def fit_model(self, **kwargs):
...
to_retrain = kwargs.get('to_retrain', False)
ckpt_path = kwargs.get('ckpt_path', None)
trainer = self._get_trainer(cluster_id, gpu_id, to_retrain) # returns a pl.Trainer object
tft_lightning_module = self._prepare_for_training(cluster_id, to_retrain)
train_dtloaders = ...
val_dtloaders = ...
if not to_retrain:
trainer.fit(
tft_lightning_module,
train_dataloaders=train_dtloaders,
val_dataloaders=val_dtloaders
)
else:
trainer.fit(
tft_lightning_module,
train_dataloaders=train_dtloaders,
val_dataloaders=val_dtloaders,
ckpt_path=ckpt_path
)
best_model_path = trainer.checkpoint_callback.best_model_path
return best_model_path
While I call the above method in my retraining phase, I can see the log where it says that it's loading the checkpointed model:
Restored all states from the checkpoint file at /tft/incremental_training/tft_training_20230206/171049/lightning_logs_3/lightning_logs/version_0/checkpoints/epoch=4-step=5.ckpt
But unfortunately, no further training is happening at phase 3. If I look at the best_model_path returned by the method, it has the old checkpoint path from train/validation phase and not from retraining phase. How to resolve this issue?
I'm using the following libraries
pytorch-lightning==1.6.5
pytorch-forecasting==0.9.0
I got it working finally. The key thing to keep in mind here is not to use same number of epochs in both training and retraining. If we're training for x epochs and intend to run the retraining for y more epochs, then max-epochs has to be set to x+y and not y in retraining.
I am working on a deep learning model that uses a large amount of time series related data. As the data is too big to be loaded in RAM at once, I will use keras train_on_batch to train the model reading data from disk.
I am looking for a simple and fast process to split the data among train, validation and test folders.
I´ve tried "splitfolder" function, but could not deactivate the data shuffling (what is inapropriate for time series related data). Arguments on this function documentation does not inclued an option to turn shuffle on/off.
Code I´ve tried:
import splitfolders
input_folder = r"E:\Doutorado\apagar"
splitfolders.ratio(input_folder, output = r'E:\Doutorado\apagardivididos', ratio=(0.7, 0.2, 0.1),
group_prefix=None)
Resulting split data is shuffled, but this shuffle is a problem for my time series analysis...
source: https://pypi.org/project/split-folders/
splitfolders.ratio("input_folder", output="output",
seed=1337, ratio=(.8, .1, .1), group_prefix=None, move=False) # default values
Usage:
splitfolders [--output] [--ratio] [--fixed] [--seed] [--oversample] [--group_prefix] [--move] folder_with_images
Options:
--output path to the output folder. defaults to output. Get created if non-existent.
--ratio the ratio to split. e.g. for train/val/test .8 .1 .1 -- or for train/val .8 .2 --.
--fixed set the absolute number of items per validation/test set. The remaining items constitute
the training set. e.g. for train/val/test 100 100 or for train/val 100.
Set 3 values, e.g. 300 100 100, to limit the number of training values.
--seed set seed value for shuffling the items. defaults to 1337.
--oversample enable oversampling of imbalanced datasets, works only with --fixed.
--group_prefix split files into equally-sized groups based on their prefix
--move move the files instead of copying
I've currently been training some models from the tensorflow2 object detection model zoo following the tutorial from https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html.
One of my doubts arised when I was looking to export my model from a checkpoint that had the total loss that I was looking for. Nevertheless, I found that I had 16 checkpoints (just the last five of them) but only 1500 steps elapsed. I have to mention that I passed a flag to save a checkpoint every 100 steps. So I'm wondering if:
It creates an initial checkpoint, let's say the checkpoint 0 and if I want to export a model from the 1400th step I should take the 15th checkpoint
or
It creates a "placeholder" for the future checkpoint, i.e. If training process is currently in step 1500 it prepares to store the next checkpoint. I should then take just the 14th checkpoint.
I leave an image of reference,
In this example, I have 16 checkpoints but only 1500 steps have elapsed. I've chosen to save a new checkpoint every 100 steps. If I want to export a new model from the step 1400, should I export the 14th or 15th?
Any help would be much appreciated.
I'm training my classifier on 20k images but every week I get more new pictures so I want to incrementally train my previous model(last stopped iteration) instead of retraining 20k+new_images on all the images again which is a waste of time and compute
I figured out incremental training with Yolo but can't seem to find anything for MobileNet-SSD caffe implemented here https://github.com/chuanqi305/MobileNet-SSD
To understand more about what I'm talking about refer to this:
How to do incremental training on the basis of yolov3.weights & answer to this mention here:
darknet.exe partial cfg/yolov3.cfg yolov3.weights yolov3.conv.105 105
You need to pass previous iteration in train.sh instead of 73000 iteration. The new iteration are found in snapshot folder once you are done training
if ! test -f example/MobileNetSSD_train.prototxt ;then
echo "error: example/MobileNetSSD_train.prototxt does not exist."
echo "please use the gen_model.sh to generate your own model."
exit 1
fi
mkdir -p snapshot
#Initiate a new training
$CAFFE_ROOT/build/tools/caffe train -solver="solver_train.prototxt" \
-weights="mobilenet_iter_73000.caffemodel" \
-gpu 0
I am using Tensorflow's Object Detection API to detect cars. It should detect the cars as one class "car".
I followed sentdex's following series:
https://pythonprogramming.net/introduction-use-tensorflow-object-detection-api-tutorial/
System information:
OS - Ubuntu 18.04 LTS
GPU - Nvidia 940M (VRAM : 2GB)
Tensorflow : 1.10
Python - 3.6
CPU - Intel i5
Problem:
The training process runs pretty fine. In order to know when the model converges and when I should stop training, I observe the loss during the training per step in the terminal where the training is running and also observe the Total Loss graph in Tensorboard via running the following command in another terminal,
$tensorboard --logdit="training"
But even after training till 60k steps, the loss fluctuates between 2.1 to 1.2. If I stop the training and export the inference graph from the last checkpoint(saved in the training/ folder), it detects cars in some cases and in some it gives false positives.
I also tried running eval.py like below,
python3 eval.py --logtostderr --pipeline_config_path=training/ssd_mobilenet_v1_pets.config --checkpoint_dir=training/ --eval_dir=eval/
but it gives out an error that indicates that the GPU memory fails to run this script along with train.py.
So, I stop the training to make sure the GPU is free and then run eval.py but it creates only one eval point in eval/ folder. Why?
Also, how do I understand from the Precision graphs in Tensorboard that the training needs to be stopped?
I could also post screenshots if anyone wants.
Should I keep training till the loss stays on an average around 1?
Thanks.
PS: Added Total Loss graph below till 66k steps.
PS2: After 2 days training(and still on) this is the total loss graph below.
Usually, one uses a separate set of data to measure the error and generalisation abilities of the model. So, one would have the following sets of data to train and evaluate a model:
Training set: The data used to train the model.
Validation set: A separate set of data which will be used to measure the error during training. The data of this set is not used to perform any weight updates.
Test set: This set is used to measure the model's performance after the training.
In your case, you would have to define a separate set of data, the validation set and run an evaluation repeadingly after a fixed number of batches/steps and log the error or accuracy. What usually happens is, that the error on that data will decrease in the beginning and increase at a certain point during training. So it's important to keep track of that error and to generate a checkpoint whenever this error is decreases. The checkpoint with the lowest error on your validation data is one that you want to use. This technique is called Early Stopping.
The reason why the error increases after a certain point during training is called Overfitting. It tells you that the model losses it's ability to generalize to unseen data.
Edit:
Here's an example of a training loop with early stopping procedure:
for step in range(1, _MAX_ITER):
if step % _TEST_ITER == 0:
sample_count = 0
while True:
try:
test_data = sess.run(test_batch)
test_loss, summary = self._model.loss(sess, test_data[0], self._assign_target(test_data), self._merged_summary)
sess.run(self._increment_loss_opt, feed_dict={self._current_loss_pl: test_loss})
sample_count += 1
except tf.errors.OutOfRangeError:
score = sess.run(self._avg_batch_loss, feed_dict={self._batch_count_pl: sample_count})
best_score =sess.run(self._best_loss)
if score < best_score:
'''
Save your model here...
'''