Keras: how to figure out the Null hypothesis? - statistics

I am training a deep neural net using keras. One of the scores is called val_acc. I get like a 70% val_acc. How do I know if this is good or bad? The neural net is a binary classifier, so I am trying to predict a 1 or a 0. The data itself is about 65% 0's and 35% 1's. Is my 70% val_acc any good?

Accuracy is not always the right metric for the evaluation of a classifier. For example, it could be more important for you to classify the 1s more correctly than 0s (for example fraud detection) or the other way. So you may be interested to have a classifier with higher precision (specificity) or recall (sensitivity). In other words, false positives may be more expensive for you than false negatives. If you have some idea about the costs of misclassifications (e.g. for FPs & FNs) then you can precisely compute the specific threshold that will be optimal (instead of default 0.5) for 0-1 classification. You can use ROC curves and AUC to find performance of your classifier as well (the higher AUC the better). Finally you may want to consider kappa statistics to find how useful / effective your classifier is.

Related

Multilabel text classification with BERT and highly imbalanced training data

I'm trying to train a multilabel text classification model using BERT. Each piece of text can belong to 0 or more of a total of 485 classes. My model consists of a dropout layer and a linear layer added on top of the pooled output from the bert-base-uncased model from Hugging Face. The loss function I'm using is the BCEWithLogitsLoss in PyTorch.
I have millions of labeled observations to train on. But the training data are highly unbalanced, with some labels appearing in less than 10 observations and others appearing in more than 100K observations! I'd like to get a "good" recall.
My first attempt at training without adjusting for data imbalance produced a micro recall rate of 70% (good enough) but a macro recall rate of 45% (not good enough). These numbers indicate that the model isn't performing well on underrepresented classes.
How can I effectively adjust for the data imbalance during training to improve the macro recall rate? I see we can provide label weights to BCEWithLogitsLoss loss function. But given the very high imbalance in my data leading to weights in the range of 1 to 1M, can I actually get the model to converge? My initial experiments show that a weighted loss function is going up and down during training.
Alternatively, is there a better approach than using BERT + dropout + linear layer for this type of task?
In your case it might be helpful to balance the labels in the training data. You have a lot of data, so you could afford to loose a part of it by balancing. But before you do this, I recommend to read this answer about balancing classes in traing data.
If you really only care about recall, you could try to tune your model maximizing recall.

How to adjust linear regression prediction based on uncertainty?

In risk management is frequent that we want to be more conservative in the face of high uncertainty. Is there a way of "adjusting" a linear regression prediction based on the uncertainty (i.e., standard deviation of prediction)? For example, if I have that prediction 1 = prediction 2 = 100, but prediction 2 has a higher uncertainty, then I'd like to adjust prediction 2 to be smaller than prediction 1 because I'm acknowledging risk and being more conservative.
I assume this is a common problem, but I haven't been able to find anything online for some reason.
Thanks!
I would like a recommendation

What happens if optimal training loss is too high

I am training a Transformer. In many of my setups I obtain validation and training loss that look like this:
Then, I understand that I should stop training at around epoch 1. But then the training loss is very high. Is this a problem? Does the value of training loss actually mean anything?
Thanks
Regarding your first question - it is not necessarily a problem that your training loss is high, since there is no threshold for what is considered as a high training loss. It depends on your dataset, your actual test metrics and your business goals.
More specifically, the problems with the value of training loss:
The number isn't intuitive, since the loss objective is a metric optimized for gradient descent (i.e. a differentiable function, usually the log version of it).
You probably have intuitive business metrics (e.g., precision, recall) oriented towards your end goal, which you should use to decide if your model is good or not.
Your train loss is calculated on the training dataset, which is not always representative of a good model, as can be seen in the overfitted model you posted. You shouldn't use this number to make decisions for the goodness of the model.
It depends on what you are trying to achieve. Is 80% accuracy high or low?
Regarding your second question - Technically, the higher the number the worse the model did in converging, so you should always try to lower it (while taking into consideration overfitting).
Comparatively, you can say that one model has a higher loss than another and then try multiple hyperparameters (e.g., dropout, different optimizers) to minimize the point where the validation set diverges.
You are describing overfitting: Your model's expressive power is too strong and it is memorizing the training data, rather than learning useful representations that can generalize to the validation data.
To mitigate this issue, you should apply stronger regularization to your model to prevent it from memorizing and steer it towards useful representations.
regularization methods include (but are not limited to):
Input augmentations
DropOut
Early stopping
Weight decay

SkikitLearn learning curve strongly dependent on batch size of MLPClassifier ??? Or: how to diagnose bias/ variance for NN?

I am currently working on a classification problem with two classes in ScikitLearn with the solver adam and activation relu. To explore if my classifier suffers from high bias or high variance, I plotted the learning curve with Scikitlearns build-in function:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
I am using a Group-K_Fold crossvalidation with 8 splits.
However, I found that my learning curve is strongly dependent on the batch size of my classifier:
https://imgur.com/a/FOaWKN1
Is it supposed to be like this? I thought learning curves are tackling the accuracy scores dependent on the portion of training data independent from any batches/ epochs? Can I actually use this build-in function for batch methods? If yes, which batch size should I choose (full batch or batch size= number of training examples or something in between) and what diagnosis do I get from this? Or how do you usually diagnose bias/ variance problems of a neural network classifier?
Help would be really appreciated!
Yes, the learning curve depends on the batch size.
The optimal batch size depends on the type of data and the total volume of the data.
In ideal case batch size of 1 will be best, but in practice, with big volumes of data, this approach is not feasible.
I think you have to do that through experimentation because you can’t easily calculate the optimal value.
Moreover, when you change the batch size you might want to change the learning rate as well so you want to keep the control over the process.
But indeed having a tool to find the optimal (memory and time-wise) batch size is quite interesting.
What is Stochastic Gradient Descent?
Stochastic gradient descent, often abbreviated SGD, is a variation of the gradient descent algorithm that calculates the error and updates the model for each example in the training dataset.
The update of the model for each training example means that stochastic gradient descent is often called an online machine learning algorithm.
What is Batch Gradient Descent?
Batch gradient descent is a variation of the gradient descent algorithm that calculates the error for each example in the training dataset, but only updates the model after all training examples have been evaluated.
One cycle through the entire training dataset is called a training epoch. Therefore, it is often said that batch gradient descent performs model updates at the end of each training epoch.
What is Mini-Batch Gradient Descent?
Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients.
Implementations may choose to sum the gradient over the mini-batch or take the average of the gradient which further reduces the variance of the gradient.
Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. It is the most common implementation of gradient descent used in the field of deep learning.
Source: https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/

Test error lower than training error

Would appreciate your input on this. I am constructing a regression model with the help of genetic programming.
If my RMSE on test data is (much) lower than my RMSE on training data for a 1:5 ratio of data, should I be worried?
The test data is drawn randomly without replacement from a set of 24 data points. The model was built using genetic programming technique so the number of features, modeling framework etc vary as I minimize the training RMSE regularized by the number of nodes in the GP tree.
Is the model underfitted? Or should I have minimized MSE instead of RMSE (I thought it would be the same as MSE is positive and the minimum of MSE would coincide with the minimum of RMSE assuming the optimizer is good enough to find the minimum)?
Tks
So your model is trained on 20 out of 24 data points and tested on the 4 remaining data points?
To me it sounds like you need (much) more data, so you can have a larger train and test sets. I'm not surprised by the low performance on your test set as it seems that your model wasn't able to learn from such few data. As a rule of thumb, for machine learning you can never have enough data. Is it a possibility to gather a larger dataset?

Resources