logistic regression, model performance - statistics

I am interested to understand in which scenarios person should use sensitivity, specificity, and when should person opt for precision recall.
On a high level I understand for a balanced data set we should use precision, recall and if dataset is imbalanced we should use sensitivity and specificity. but I am not sure why they say it.
If you people have different perspective, pls throw some light on how to perceive these.
Thanks

Related

Normality Assumption - how to check you have not violated it?

I am rleatively new to statistics and am stuggling with the normality assumption.
I understand that parametric tests are underpinned by the assumption that the data is normally distributed, but there seems to be lots of papers and articles providing conflicting information.
Some articles say that independant variables need to be normally disrbiuted and this may require a transformation (log, SQRT etc.). Others says that in linear modelling there are no assumptions about any linear the distribution of the independent variables.
I am trying to create a multiple regression model to predict highest pain scores on hospital admissions:
DV: numeric pain scores (0-no pain -> 5 intense pain)(discrete- dependant variable).
IVs: age (continuous), weight (continuous), sex (nominal), depreviation status (ordinal), race (nominal).
Can someone help clear up the following for me?
Before fitting a model, do I need to check the whether my independant variables are normally distributed? If so, why? Does this only apply to continuous variables (e.g. age and weight in my model)?
If age is positively skewed, would a transformation (e.g. log, SQRT) be appropriate and why? Is it best to do this before or after fitting a model? I assume I am trying to get close to a linear relationship between my DV and IV.
As part of the SPSS outputs it provides plots of the standardised residuals against predicted values and also normal P-P plots of standardised residuals. Are these tests all that is needed to check the normality assumption after fitting a model?
Many Thanks in advance!

What happens if optimal training loss is too high

I am training a Transformer. In many of my setups I obtain validation and training loss that look like this:
Then, I understand that I should stop training at around epoch 1. But then the training loss is very high. Is this a problem? Does the value of training loss actually mean anything?
Thanks
Regarding your first question - it is not necessarily a problem that your training loss is high, since there is no threshold for what is considered as a high training loss. It depends on your dataset, your actual test metrics and your business goals.
More specifically, the problems with the value of training loss:
The number isn't intuitive, since the loss objective is a metric optimized for gradient descent (i.e. a differentiable function, usually the log version of it).
You probably have intuitive business metrics (e.g., precision, recall) oriented towards your end goal, which you should use to decide if your model is good or not.
Your train loss is calculated on the training dataset, which is not always representative of a good model, as can be seen in the overfitted model you posted. You shouldn't use this number to make decisions for the goodness of the model.
It depends on what you are trying to achieve. Is 80% accuracy high or low?
Regarding your second question - Technically, the higher the number the worse the model did in converging, so you should always try to lower it (while taking into consideration overfitting).
Comparatively, you can say that one model has a higher loss than another and then try multiple hyperparameters (e.g., dropout, different optimizers) to minimize the point where the validation set diverges.
You are describing overfitting: Your model's expressive power is too strong and it is memorizing the training data, rather than learning useful representations that can generalize to the validation data.
To mitigate this issue, you should apply stronger regularization to your model to prevent it from memorizing and steer it towards useful representations.
regularization methods include (but are not limited to):
Input augmentations
DropOut
Early stopping
Weight decay

How to know which features contribute significantly in prediction models?

I am novice in DS/ML stuff. I am trying to solve Titanic case study in Kaggle, however my approach is not systematic till now. I have used correlation to find relationship between variables and have used KNN and Random Forest Classification, however my models performance has not improved. I have selected features based on the result of correlation between variables.
Please guide me if there are certain sk-learn methods which can be used to identify features which can contribute significantly in forecasting.
Through Various Boosting Techniques You can Improve accuracy approx 99% I suggest you to use Gradient Boosting.

feature_importances_ when using random forests in scikit-learn

I am using Random forests in scikit-learn. I used feature_importances_ to see how much each feature is important in prediction goal. But I don't understand what is this score. Googling feature_importances_ says it is the mean decrease impurity. But I'm still confused whether this is the same as mean decrease gigi impurity. If so, how it is calculated for trees and random forests? Beside the math I want to really understand what does it mean.
feature_importances_ function will tell you how much each feature is contributing towards prediction (Information gain)
Random forest classify the independent variables or features based on Gini, Information Gain, Chi-square or entropy. Those features will get high score which contribute maximum to the information gain.

information criteria for confusion matrices

One can measure goodness of fit of a statistical model using Akaike Information Criterion (AIC), which accounts for goodness of fit and for the number of parameters that were used for model creation. AIC involves calculation of maximized value of likelihood function for that model (L).
How can one compute L, given prediction results of a classification model, represented as a confusion matrix?
It is not possible to calculate the AIC from a confusion matrix since it doesn't contain any information about the likelihood. Depending on the model you are using it may be possible to calculate the likelihood or quasi-likelihood and hence the AIC or QIC.
What is the classification problem that you are working on, and what is your model?
In a classification context often other measures are used to do GoF testing. I'd recommend reading through The Elements of Statistical Learning by Hastie, Tibshirani and Friedman to get a good overview of this kind of methodology.
Hope this helps.
Information-Based Evaluation Criterion for Classifier's Performance by Kononenko and Bratko is exactly what I was looking for:
Classification accuracy is usually used as a measure of classification performance. This measure is, however, known to have several defects. A fair evaluation criterion should exclude the influence of the class probabilities which may enable a completely uninformed classifier to trivially achieve high classification accuracy. In this paper a method for evaluating the information score of a classifier''s answers is proposed. It excludes the influence of prior probabilities, deals with various types of imperfect or probabilistic answers and can be used also for comparing the performance in different domains.

Resources