Does Huggingface's "resume_from_checkpoint" work?

Does Huggingface's "resume_from_checkpoint" work? - pytorch

I currently have my trainer set up as:
training_args = TrainingArguments(
output_dir=f"./results_{model_checkpoint}",
evaluation_strategy="epoch",
learning_rate=5e-5,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
num_train_epochs=2,
weight_decay=0.01,
push_to_hub=True,
save_total_limit = 1,
resume_from_checkpoint=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_qa["train"],
eval_dataset=tokenized_qa["validation"],
tokenizer=tokenizer,
data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
compute_metrics=compute_metrics
)
After training, in my output_dir I have several files that the trainer saved:
['README.md',
'tokenizer.json',
'training_args.bin',
'.git',
'.gitignore',
'vocab.txt',
'config.json',
'checkpoint-5000',
'pytorch_model.bin',
'tokenizer_config.json',
'special_tokens_map.json',
'.gitattributes']
From the documentation it seems that resume_from_checkpoint will continue training the model from the last checkpoint:
resume_from_checkpoint (str or bool, optional) — If a str, local path to a saved checkpoint as saved by a previous instance of Trainer. If a bool and equals True, load the last checkpoint in args.output_dir as saved by a previous instance of Trainer. If present, training will resume from the model/optimizer/scheduler states loaded here.
But when I call trainer.train() it seems to delete the last checkpoint and start a new one:
Saving model checkpoint to ./results_distilbert-base-uncased/checkpoint-500
...
Deleting older checkpoint [results_distilbert-base-uncased/checkpoint-5000] due to args.save_total_limit
Does it really continue training from the last checkpoint (i.e., 5000) and just starts the count of the new checkpoint at 0 (saves the first after 500 steps -- "checkpoint-500"), or does it simply not continue the training? I haven't found a way to test it and the documentation is not clear on that.

Looking at the code, it first loads the checkpoint state, updates how many epochs have already been run, and continues training from there to the total number of epochs you're running the job for (no reset to 0).
To see it continue training, increase your num_train_epochs before calling trainer.train() on your checkpoint.

You also should add resume_from_checkpoint parametr to trainer.train with link to checkpoint
trainer.train(resume_from_checkpoint="{<path-where-checkpoint-were_stored>/checkpoint-0000")
0000- example of checkpoin number.
Don't forget to mount your drive during whole this process.

Related

What is the best way to resume training with a different learning rate in pytorch lightning

I want to train a model in two stages. The first one is a pre-training with teacher forcing, and the second one is a regular training (without teacher forcing). The difference here is that the model is instantiated with use_teacher_forcing=True in the first case and use_teacher_forcing=False in the latter.
To do so, I currently run two trainings, where the second training resumes from the first trainings checkpoint, by passing the last checkpoint to the lightning trainer.
Regarding the learning rate, I want to decay it over several milestones as well in pre-training as well as in regular training. For instance, if I use 5 epochs of pre-training and 5 epochs of training, I want the learning rate to be as follows:
0
1
2
3
4
5
6
7
8
9
1e-4
1e-4
1e-5
1e-5
1e-6
1e-4
1e-4
1e-5
1e-5
1e-6
However, I cannot find a way reset the learning rate to its initial value at the beginning of the regular training, since the scheduler is also loaded from the checkpoint.
Is there a way to do this?
I am using torch 1.9.0 und pytorch-lightning 1.3.8 and am not able to upgrade to later versions.

I came across the following solution.
Apparently, it's not that hard to implement and use a custom learning rate scheduler. I'll leave the code here in case anybody stumbles upon the same problem.
class MultiStepLRWithReset(_LRScheduler):
def __init__(self, optimizer, milestones, reset_epochs, reset_lr_to=None, gamma=0.1, last_epoch=-1, verbose=False):
full_milestones = milestones
for reset_epoch in reset_epochs:
full_milestones += [m + reset_epoch for m in milestones]
self.milestones = Counter(full_milestones)
self.reset_epochs = reset_epochs
self.reset_lr_to = reset_lr_to
self.gamma = gamma
super(MultiStepLRWithReset, self).__init__(optimizer, last_epoch, verbose)
def get_lr(self):
if not self._get_lr_called_within_step:
warnings.warn("To get the last learning rate computed by the scheduler, "
"please use `get_last_lr()`.", UserWarning)
if self.last_epoch in self.reset_epochs:
if self.reset_lr_to is None:
return [group['initial_lr'] for group in self.optimizer.param_groups]
else:
return [self.reset_lr_to for _ in self.optimizer.param_groups]
if self.last_epoch not in self.milestones:
return [group['lr'] for group in self.optimizer.param_groups]
return [group['lr'] * self.gamma ** self.milestones[self.last_epoch]
for group in self.optimizer.param_groups]
You will have to create a LRScheduler for the entire training as it will not be reinstantiated for the second training stage if all the pytorch training components are loaded from their last checkpoint.

MaskRCNN should find exactly one element

I trained a maskrcnn-model with matterport with one class to detect. It worked.
Now I want to predict some unseen images. I know that the object is present on each image and that it appears only once per image. How do I use my model to do so?
An possibility that came to my mind was:
num_results = 0
while num_results = 0:
model = mrcnn.model.MaskRCNN(mode='inference', config=pred_config)
model.load_weights('weight/path')
results = model.detect([img], verbose=1)
num_results = compute_num_of(results)
# lower DETECTION_MIN_CONFIDENCE from pred_config
But I think this is very time consuming because I load that model and its weights at every step. What would be best practice here?
Thanks

Keras fitting setting in TensorFlow Extended (TFX)

I try to construct a TFX pipeline with a trainer component with a Keras model defined like this:
def run_fn(fn_args: components.FnArgs):
transform_output = TFTransformOutput(fn_args.transform_output)
train_dataset = input_fn(fn_args.train_files,
fn_args.data_accessor,
transform_output,
num_batch)
eval_dataset = input_fn(fn_args.eval_files,
fn_args.data_accessor,
transform_output,
num_batch)
history = model.fit(train_dataset,
epochs=num_epochs,
steps_per_epoch=fn_args.train_steps,
validation_data=eval_dataset,
validation_steps=fn_args.eval_steps)
This works. However, if I change fitting to the following, this doesn't work:
history = model.fit(train_dataset,
epochs=num_epochs,
batch_size=num_batch,
validation_split=0.1)
Now, I have two questions:
Why does fitting work only with steps_per_epochs only? I couldn't find any explicit statement supporting this but this is the only way. Somehow I conclude that it must be something TFX specific (TFX handles input data only in a generator-like way?).
Let's say my train_dataset contains 100 instances and steps_per_epoch=1000 (with epochs=1). Is that mean that my 100 input instances are feed 10x each in order to reach the defined 1000 step? Isn't that counter-productive from training perspective?

How to Fasten Knn Algorithm for face recognition in real time

I am doing my work on face detection and recognition, where I want to detect the faces in real time,
but when coming to the point of training it is taking very long time to train the
data is it possible to reduce the timing of training the data can any one help
me out with this problem
'''
def train(train_dir, model_save_path=None, n_neighbors=None, knn_algo='ball_tree', verbose=False):
X = []
y = []
# Loop through each person in the training set
for class_dir in tqdm(os.listdir(train_dir)):
if not os.path.isdir(os.path.join(train_dir, class_dir)):
continue
# Loop through each training image for the current person
for img_path in image_files_in_folder(os.path.join(train_dir, class_dir)):
image = face_recognition.load_image_file(img_path)
face_bounding_boxes = face_recognition.face_locations(image)
if len(face_bounding_boxes) != 1:
# If there are no people (or too many people) in a training image, skip the image.
if verbose:
print("Image {} not suitable for training: {}".format(img_path, "Didn't find a face" if len(face_bounding_boxes) < 1 else "Found more than one face"))
else:
# Add face encoding for current image to the training set
X.append(face_recognition.face_encodings(image, known_face_locations=face_bounding_boxes)[0])
y.append(class_dir.split('_')[0])
# Determine how many neighbors to use for weighting in the KNN classifier
if n_neighbors is None:
n_neighbors = int(round(math.sqrt(len(X))))
if verbose:
print("Chose n_neighbors automatically:", n_neighbors)
# Create and train the KNN classifier
knn_clf = neighbors.KNeighborsClassifier(n_neighbors=n_neighbors, algorithm=knn_algo, weights='distance')
print(knn_clf)
knn_clf.fit(X, y)
# Save the trained KNN classifier
if model_save_path is not None:
with open(model_save_path, 'wb') as f:
pickle.dump(knn_clf, f)
return knn_clf
'''
this the final call
'''
def trainer():
# STEP 1: Train the KNN classifier and save it to disk
# Once the model is trained and saved, you can skip this step next time.
print("Training KNN classifier...")
classifier = train("app/facerec/dataset", model_save_path="app/facerec/models/trained_model.clf", n_neighbors=3)
print("Training complete!")
'''
also wants to know is there any possibility instead of rewriting the 'trained_model.clf' file can we update the file instead.

Training kNN model shouldn't impose high runtime overhead. After all, the straightforward ("exact search") model is lazy. It stores the vectors and performs brute-force search at query (or classification) time.
I speculate the embedding computations dominate your training time.
As mentioned by #johncasey, you might want to use approximated-kNN models (or similarity search engines). There are many open-source similarity search libraries. Yet, if you need a production-ready, robust, real-time, efficient solution, then you should check out pinecone.io. (Disclaimer, I work for Pinecone.)

k-nn algorithm has a O(n) time complexity. I recommend you to use approximate nearest neighbor (a-nn) algorithm. Its time complexity is too low. For example, Google image search is based on this algorithm.
Spotify annoy, Facebook faiss, nmslib are a-nn libraries.

tuning neural network parameter via Hyperopt: how to dump trials

When I try to save hyperopt.trials object, which contains information about auto params tuning in neural network,
best = fmin(fn = objective,
space = space,
algo = tpe.suggest, # or rand.suggest for random params selection
max_evals = max_trials,
trials = trials) #, rstate = np.random.RandomState(50)
pickle.dump(trials, open("neuro.hyperopt", "wb"))
it gives the error:
can't pickle _thread.RLock objects
Moreover, it loads on my local drive a file of 10GB size. That is, it saves not only the trials object, but the whole model.
Would you help me to save trials object with less size (e.g. the XGBoost trials file's size is 1Mb) and avoid the error.
Thank you.

In my case it was because the models stored in the trials were not pickle-able.
I tried to save tf.keras.optimizers.Adam(learning_rate = 0.001) object.
When I added the string 'Adam' instead, the error disapeared.
Of course, it creates another problem: how to setup learining rate for the optimizor. But it seems to be easier. One way is to replace the keras object with string in the trials.trials object before saving:
for trial in trials.trials:
if 'result' in trials.keys():
trials['result'].pop('model', None) # https://stackoverflow.com/questions/15411107/delete-a-dictionary-item-if-the-key-exists
# proceed with pickling
pickle.dump(trials, open("trials.pkl","wb"))
(I took it from here)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Does Huggingface's "resume_from_checkpoint" work? - pytorch

You also should add resume_from_checkpoint parametr to trainer.train with link to checkpoint trainer.train(resume_from_checkpoint="{<path-where-checkpoint-were_stored>/checkpoint-0000") 0000- example of checkpoin number. Don't forget to mount your drive during whole this process.

Related

What is the best way to resume training with a different learning rate in pytorch lightning

MaskRCNN should find exactly one element

Keras fitting setting in TensorFlow Extended (TFX)

How to Fasten Knn Algorithm for face recognition in real time

tuning neural network parameter via Hyperopt: how to dump trials

Categories

Resources