I am training a CNN model(made using Keras). Input image data has around 10200 images. There are 120 classes to be classified. Plotting the data frequency, I can see that sample data for every class is more or less uniform in terms of distribution.
Problem I am facing is loss plot for training data goes down with epochs but for validation data it first falls and then goes on increasing. Accuracy plot reflects this. Accuracy for training data finally settles down at .94 but for validation data its around 0.08.
Basically its case of over fitting.
I am using learning rate of 0.005 and dropout of .25.
What measures can I take to get better accuracy for validation? Is it possible that sample size for each class is too small and I may need data augmentation to have more data points?
Hard to say what could be the reason. First you can try classical regularization techniques like reducing the size of your model, adding dropout or l2/l1-regularizers to the layers. But this is more like randomly guessing the models hyperparameters and hoping for the best.
The scientific approach would be to look at the outputs for your model and try to understand why it produces these outputs and obviously checking your pipeline. Did you had a look at the outputs (are they all the same)? Did you preprocess the validation data the same way as the training data? Did you made a stratified train/test-split, i.e. keeping the class distribution the same in both sets? Is the data shuffles when you feed it to your model?
In the end you have about ~85 images per class which is really not a lot, compare CIFAR-10 resp. CIFAR-100 with 6000/600 images per class or ImageNet with 20k classes and 14M images (~500 images per class). So data augmentation could be beneficial as well.
Related
I'm trying to train a multilabel text classification model using BERT. Each piece of text can belong to 0 or more of a total of 485 classes. My model consists of a dropout layer and a linear layer added on top of the pooled output from the bert-base-uncased model from Hugging Face. The loss function I'm using is the BCEWithLogitsLoss in PyTorch.
I have millions of labeled observations to train on. But the training data are highly unbalanced, with some labels appearing in less than 10 observations and others appearing in more than 100K observations! I'd like to get a "good" recall.
My first attempt at training without adjusting for data imbalance produced a micro recall rate of 70% (good enough) but a macro recall rate of 45% (not good enough). These numbers indicate that the model isn't performing well on underrepresented classes.
How can I effectively adjust for the data imbalance during training to improve the macro recall rate? I see we can provide label weights to BCEWithLogitsLoss loss function. But given the very high imbalance in my data leading to weights in the range of 1 to 1M, can I actually get the model to converge? My initial experiments show that a weighted loss function is going up and down during training.
Alternatively, is there a better approach than using BERT + dropout + linear layer for this type of task?
In your case it might be helpful to balance the labels in the training data. You have a lot of data, so you could afford to loose a part of it by balancing. But before you do this, I recommend to read this answer about balancing classes in traing data.
If you really only care about recall, you could try to tune your model maximizing recall.
I develpoed a simple CNN model for MNIST dataset and i got 98% validation accuracy. But after saving the model through keras as model.h5 and evaluating the inference of th saved model in another jypyter session, the performance of the model is poor and the predictions are random
What needs to be done to get same accuracy after saving and uploading the model in different jypyter notebook session?
(Consider sharing your code/results so the community can help you better).
I'm assuming you're using Tensorflow/Keras, so model.save('my_model.h5') after your model.fit(...) should save the model, including the trained parameters (but not including the internal optimizer data; i.e gradients, etc..., which shouldn't affect the prediction capabilities of the model).
A number of things could cause a generalization gap like that, but...
Case 1: having a high training/validation accuracy and a low test (prediction) accuracy typically means your model overfit on the given training data.
I suggest adding some regularization to your training phase (dropout layers, cutout augmentation, L1/L2, etc...), a fewer number of epochs or early-stopping, or cross-validation/data reshuffle to cross off the possibility of overfitting.
Case 2: low intrinsic dataset variance, but unless you're using a subset of MNIST, this is unlikely. Make sure you are properly splitting your training/validation/test sets.
Again, it could be a number of issues, but these are the most common cases for low model generalization. Post your code (specifying the architecture, optimizer, hyperparameters, data prepropcessing, and test data used) so the answers can be more relevant to your problem.
I'm using Windows 10 machine.
Libraries: Keras with Tensorflow 2.0
Embeddings:Glove(100 dimensions)
I am trying to implement an LSTM architecture for multi-label text classification.
My problem is that no matter how much fine-tuning I do, the results are really bad.
I am not experienced in DL practical implementations that's why I ask for your advice.
Below I will state basic information about my dataset and my model so far.
I can't embed images since I am a new member so they appear as links.
Dataset form+Embedings form+train-test-split form
Dataset's labels distribution
My Implementation of LSTM
Model's Summary
Model's Accuracy plot
Model's Loss plot
As you can see my dataset is really small (~6.000 examples) and maybe that's one reason why I cannot achieve better results. Still, I chose it because it's unbiased.
I'd like to know if there is any fundamental mistake in my code regarding the dimensions, shape, activation functions, and loss functions for multi-label text classification?
What would you recommend to achieve better results on my model? Also any general advice regarding optimizing, methods,# of nodes, layers, dropouts, etc is very welcome.
Model's best val accuracy that I achieved so far is ~0.54 and even if I tried to raise it, it seems stuck there.
There are many ways to get this wrong but the most common mistake is to get your model overfit the training data.
I suspect that 0.54 accuracy means that your model selects the most common label (offensive) for almost all cases.
So, consider one of these simple solutions:
Create balanced training data: like 400 samples from each class.
or sample balanced batches for training (exactly the same number of labels on each training batch)
In addition to tracking accuracy and loss, look at precision-recall-f1 or even better try plotting area under curve, maybe different classes need different thresholds of activation. (If you are using Sigmoid on last layer maybe one class could perform better with 0.2 activations and another class with 0.7)
first try simple model. embedding 1 layer LSTM than classify
how to tokenize text , is vocab size enough ?
try dice loss
Data augmentation is surely a great regularization method, and it improves my accuracy on the unseen test set. However, I do not understand why it reduces the convergence speed of the network? I know each epoch takes a longer time to train since image transformations are applied on the fly. But why does it affect the convergence? For my current setup, the network hits a 100% training accuracy after 5 epochs without data augmentation (and clearly overfits) - with data augmentation, it takes 23 epochs to hit 95% training accuracy and never seems to hit 100%.
Any links to research papers or comments on the reasonings behind this?
I guess you are evaluating accuracy on the train set, right? And it is a mistake...
Without augmentation your network simply overfits. You have a predefined number of images, for instance, 1000, and your network during training can easily memorize dataset labels. And you are evaluating the model on the fixed (not augmented) dataset.
When you are training your network with data augmentation, basically, you are training a model on a dataset of infinite size. You are doing augmentation on the fly, which means that the model "sees" new images every time, and it cannot memorize them perfectly with 100% accuracy. And you are evaluating the model on the augmented (infinite) dataset.
When you train your model with and without augmentation, you evaluate it on the different datasets, so it is not correct to compare their accuracy.
Piece of advice:
Do not look at train set accuracy, it is simply misleading when you use augmentations. Instead - evaluate your model on the test set (or validation set), which is not augmented. By doing this - you'll see the real accuracy increase for your model.
P.S. If you want to find out more about image augmentaitons, I really recommend you to check this guide - https://notrocketscience.blog/complete-guide-to-data-augmentation-for-computer-vision/
I am trying to classify around 400K data with 13 attributes. I have used python sklearn's SVM package, but it didn't work, and then I learned that SVM's are not suitable for large dataset classification. Then I used the (sklearn) ANN using the following MLPClassifier:
MLPClassifier(solver='adam', alpha=1e-5, random_state=1,activation='relu', max_iter=500)
and trained the system using 200K samples, and tested the model on the remaining ones. The classification worked well. However, my concern is that the system is over trained or overfit. Can you please guide me on the number of hidden layers and node sizes to make sure that there is no overfit? (I have learned that the default implementation has 100 hidden neurons. Is it ok to use the default implementation as is?)
To know if your are overfitting you have to compute:
Training set accuracy
Test set accuracy
Once you have calculated this scores, compare it. If training set score is much better than your test set score, then you are overfitting. This means that your model is "memorizing" your data, instead of learning from it to make future predictions.
If you are overfitting with Neuronal Networks you probably have to reduce the number of layers and reduce the number of neurons per layer. There isn't any strict rule that says the number of layer or neurons you need depending on you dataset size. Every dataset can behaves completely different with the same dataset size.
So, to conclude, if you are overfitting, you would have to evaluate your model accuracy using different parameters of layers and number of neurons, and, then, observe with which values you obtain the best results. There are some methods you can use to find the best parameters, is like gridsearchCV.