I'm using tensorboard on pytorch: https://pytorch.org/docs/stable/tensorboard.html
I save images every epoch with add_image: https://pytorch.org/docs/stable/tensorboard.html#torch.utils.tensorboard.writer.SummaryWriter.add_image
I'm trying to find a way to keep only the last epoch of images because otherwise the tensorboard file becomes huge.
Is there a way to do it? Also if I want also to keep the best epoch j. Is there a way to do this too?
A similar question is also here: Remove image outputs from TensorBoard but it's not defined like this + it ~6 years ago.
Related
I am trying to use the TensorFlow object detection API to recognize a specific object (guitars) in pictures and videos.
As for the data, I downloaded the images from the OpenImage dataset, and derived the .tfrecord files. I am testing with different numbers, but for now let's say I have 200 images in the training set and 100 in the evaluation one.
I'm traininig the model using the "ssd_mobilenet_v1_coco" as a starting point, and the "model_main.py" script, so that I can have training and validation results.
When I visualize the training progress in TensorBoard, I get the following results for train:
and validation loss:
respectively.
I am generally new to computer vision and trying to learn, so I was trying to figure out the meaning of these plots.
The training loss goes as expected, decreasing over time.
In my (probably simplistic) view, I was expecting the validation loss to start at high values, decrease as training goes on, and then start increasing again if the training goes on for too long and the model starts overfitting.
But in my case, I don't see this behavior for the validation curve, which seems to be trending upwards basically all the time (excluding fluctuations).
Have I been training the model for too little time to see the behavior I'm expecting? Are my expectations wrong in the first place? Am I misinterpreting the curves?
Ok, I fixed it by decreasing the initial_learning_rate from 0.004 to 0.0001.
It was the obvious solution, considering the wild oscillations of the validation loss, but at first I thought it wouldn't work since there seems to be already a learning rate scheduler in the config file.
However, immediately below (in the config file) there's a num_steps option, and it's stated that
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the pets dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
Honestly, I don't remember if I commented out the num_steps option...if I didn't, it seems my learning rate was kept to the initial value of 0.004, which turned out to be too high.
If I did comment it out (so that the learning scheduler was active), I guess that, instead of the decrease, it still started from too high of a value.
Anyway, it's working much better now, I hope this can be useful if anyone is experiencing the same problem.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have searched through this forum for similar questions but was unanswered (Updating Tensorflow Object detection model with new images). I have managed to create my custom train model (lets name it model1). Was wondering if can i use new images that are processed by model1 to further train model1? will it improve the accuracy of the model?
Accuracy will depend on the number of correctly classified images and not only on the total number of training images. https://developers.google.com/machine-learning/crash-course/classification/accuracy. If you consider that the new images are to be used for training (have correct labels), then you should consider re-training the model. Take a look at this post https://datascience.stackexchange.com/questions/12761/should-a-model-be-re-trained-if-new-observations-are-available
You can use your current model (model1) in a number of ways:
on new images to detect bad results (hard examples) for new training
on new images to detect good results for evaluation
on the images in the existing dataset to detect bad images (wrong label etc.)
Some of the bad results from new images will be non-objects (adversarial) and not directly usable for training (but see this: https://github.com/tensorflow/models/issues/3578#issuecomment-375267920).
Removal of bad images from the existing dataset requires retraining from scratch unless there is some funky way of "untraining" images from a model.
Eventually one would end up approaching a perfect dataset that makes best use of the capacity of the chosen model architecture, although the domain may evolve over time.
I think the reason this is not much discussed is because most researchers have to work with common datasets so they can compare their approaches (brilliant read: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5697567/).
It might improve it but it is tricky. It would lead to overfitting. Improving the data set would actually help, but not with images detected by its own model. This kind of images are detected cause the model already performs well on them, so not much help.
What you need actually is quite the opposite. You need to teach the model to recognize the images that it didn't recognize before
The main problem of machine learning (that is the approach you are using for object detection here) is that of generalization. In your case, it is the ability to recognize objects of the same type as image you used for training, in images that were not used during training.
Obviously, if you were able to use all the possible images during training, your system would be perfect (actually, it would be a simple exact image matching problem). In a more realistic setup, the more training image you are using, the higher chance you have to obtain a better object detector.
Usually, it is however more valuable to add hard examples to your training set. Hence, if your application allows it (in terms of computation time in particular) you can indeed add all the images that are wrongly detected in your dataset (with the correct label) and it will probably help to get a better model, able to detect the object in harder condition on new images.
However, it really depends on what you are doing. If you want to compare your system to another one, you need to use the same (training and) test images to be fair. For benchmarking, you are not allowed to include test images in the training dataset! When you compute the accuracy (on a validation/test dataset) to compare several settings, be sure you are fair in this comparison.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Working with sklearn, the fit function of MLPClassifier is a nice one-size-fits-all solution; you call it once, and it trains until it hits the maximum number of iterations, or the training loss plateaus, all without any interaction. However, I had to change my code to accommodate some other features, and the standard fit function isn't configurable enough for what I want to do. I reconfigured my code to use partial_fit instead, manually running each iteration one at a time; but I can't figure out how to get my code to recognize when the loss plateaus, like in the fit function. I can't seem to find any properties or methods of MLPClassifier that allow me to access the loss value calculated by partial_fit, so that I can judge if the loss has plateau'd. It seems to me the only way to judge loss across each iteration would be to calculate it myself, despite the fact that partial_fit already calculates it, and even prints it to the console in verbose mode.
Edit: Running partial_fit manually still does cause the training algorithm to recognize when training loss stops improving; it prints the message Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping. after each iteration, once the training loss plateaus. However, because i'm controlling the iterations manually, it doesn't actually stop, and I have no way of figuring out in my code whether or not this message has been printed in order to stop it manually.
I would recommend to manually log the loss in a list:
loss_list = list()
clf = MLPClassifier()
#partial fit and so on
print(clf.loss_)
loss_list.append(clf.loss_)
I can provide you with a stopping criterion if this code is helpful.
I am using Pytorch on Windows 10 OS, and having trouble understanding the correct use of Pytorch TensorboardX.
After instantiating a writer (writer = SummaryWriter()), and adding the value of the loss function (in every iteration) to it (write.add_scalar('data/loss_func', loss.data[0].item(), iteration)), I have a folder which contains the saved run.
My questions are:
1) As far as I understand, after the training is complete, I need to write in the terminal (which corresponds to the command line prompt in Windows):
tensorboard --logdir=C:\Users\DrJohn\Documents\runs
where this is the folder which contains the file created by tensorboardX. What is the valid syntax in Windows command prompt? I couldn't understand this from the online tutorials
2) Is it possible to see the learning during the training, by using tensorboardX? (i.e. to plot the learning curve during the iterations?)Is the only option is to see everything once the training ends?
Thanks in advance
I am using sklearn to train a model. The train dataset is about 3000k, so i use SGDClassifier. The feature is not very good, so i know it may not converge. But i want SGDClassifier to stop early according to my setting just like max_iter = 1000. As far as I am concerned, the function SGDClassifier has no parameter like max_iter. How can i do it?
This is the code.
This is the print information.
Any help will be appreciated...
This is weird, by default in scikit-learn 0.18.2, n_iter is set to 5 epochs. Can you please update your question with a script that makes it possible to reproduce the behavior using a toy dataset (for instance generated with numpy.random.randn or similar).
Note that in scikit-learn master and 0.19 once released, n_iter will be deprecated and replaced by max_iter and a tol (for instance set to 1e-3) to automatically stop when the objective function is no longer making progress.
The 20hours running could be not so strange since you have a dataset of 3000k and you use SGDClassifier that is slow. What processor do you have?
Try stopping it by using CTRL+C if you are in Windows. Then, use n_iter to control the number of iterations that you want. The default is 5 however.
Finally, if you want to save a model see here:
Save and Load Machine Learning Models in Python with scikit-learn