Model evaluation : model.score Vs. ROC curve (AUC indicator) - python-3.x

I want to evaluate a logistic regression model (binary event) using two measures:
1. model.score and confusion matrix which give me a 81% of classification accuracy
2. ROC Curve (using AUC) which gives back a 50% value
Are these two result in contradiction? Is that possible
I'missing something but still can't find it
y_pred = log_model.predict(X_test)
accuracy_score(y_test , y_pred)
cm = confusion_matrix( y_test,y_pred )
y_test.count()
print (cm)
tpr , fpr, _= roc_curve( y_test , y_pred, drop_intermediate=False)
roc = roc_auc_score( y_test ,y_pred)
enter image description here

The accuracy score is calculated based on the assumption that a class is selected if it has a prediction probability of more than 50%. This means that you are looking only at 1 case (one working point) out of many. Let's say you'd like to classify an instance as '0' even if it has a probability greater than 30% (this may happen if one of your classes is more important for you, and its a-priori probability is very low). In this case - you will have a very different confusion matrix with a different accuracy ([TP+TN]/[ALL]). The ROC auc score examines all of these working points and gives you an estimation of your overall model. A score of 50% means that the model is equal to a random selection of classes based on your a-priori probabilities of the classes. You would like the ROC to be much higher to say that you have a good model.
So in the above case - you can say that your model does not have a good prediction strength. As a matter of fact - a better prediction will be to predict everything as "1" - in your case it will lead to an accuracy of above 99%.

Related

How does logistic regression build Sigmoid curve from categorical dependent variable?

I'm exploring the Scikit-learn logistic regression algorithm. I understand that as part of the training, the algorithm builds a regression curve where the y-variable ranges from 0 to 1 (sigmoid S-curve). The y-variable is a continuous variable here (although in reality it is a discrete variable). .
How is the algorithm able to learn the S-curve, when the training dataset reflects reality and includes the y-variable as a discrete variable? There is no probability estimate in the training, so I'm wondering how is the algorithm able to learn the S-curve.
There is no probability estimate in the training
Sure, but we pretend there is for modeling purposes. We want to maximize the probability of, as you call it, “reality”—if the observed response (the discrete value you refer to) is a 0, we want to predict that with probability 1; similarly, if the response is a 1, we want to predict that with probability 1.
Fitting the model to one data point, getting the right answer with probability 1, would be easy. Of course, we have more than one data point. We have to balance concerns between these. We want the predicted value sigmoid(weights * features) to be close to the true response (0 or 1) for all of the data points, but there may not be a way to set the parameters of the model to achieve this. (That is, the data may not be linearly separable.)
Good question! The fitting process in logistic regression is a search procedure that seeks the beta coefficients that minimize the error in the probabilities predicted by the model (continuous values) and the data (discrete values).
In logistic regression, you model probabilities using a logistic function (also known as a sigmoid function):
XB = B0 + B1 * X1 + B2 * X2 + ... + BN * XN
p(X) = e^(XB) / (1 + e^(XB))
The algorithm tries to find the beta coefficients that minimize the error using Maximum Likelihood estimation. The function to be minimized is called the cost function, and it can be any number of things. The most common ones are:
sum (P(X_i) - y_i)^2
sum |P(X_i) - y_i|
A random set of betas is picked at random, the cost is calculated and the algorithm will pick a new set of betas that will result in a lower cost. The algorithm stops searching for new betas when the decrease in cost is smaller than a given threshold (set by the tol parameter in sklearn).
The way the model converges to the final set of coefficients depends on the solver parameter. Each solver has a different way of converging to the final set of betas, but they usually converge to the same results.

Evaluating Gaussian Mixture model using a score metric?

I have 1D data (on column data). I used Gaussian Mixture Model (GMM) as a density estimation, using this implementation in Python: https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html. By relying on AIC/BIC criteron i was able to determine number of components. After i fit the GMM, i plotted kernel density estimation of original observation + that of sampled data drawn from GMM. the plot of original and sampled desnities are quiet similar( that is good). But, i would like some metrics to report how good is the fitted model.
g = GaussianMixture(n_components = 35)
data= df['x'].values.reshape(-1,1) # data taken from data frame (10,000 data pints)
clf= g.fit(data)# fit model
samples= clf.sample(10000)[0] # generate sample data points (same # as original data points)
I found score in the implementation, but not sure how to implememnt. Am i doing it wrong? or is there any better way to show how accuracy is the fitted model, apart from histogram or kernel densities plots?.
print(clf.score(data))
print(clf.score(samples))
You can use normalized_mutual_info_score, adjusted_rand_score or silhouette score to evaluate your clusters. All of these metrics are implemented under sklearn.metrics section.
EDIT: You can check this link for more detail explanations.
In a summary:
Adjusted Rand Index: measures the similarity of the two assignments.
Normalized Mutual Information: measures the agreement of the two assignments.
Silhouette Coefficient: measures how well-assigned each individual point is.
gmm.fit(x_vec)
pred = gmm.predict(x_vec)
print ("gmm: silhouttte: ", silhouette_score(x_vec, pred))
I would better use cross-validation and try to see the accuracy of the trained model.
Use the predict method of the fitted model to predict the labels of unseen data (use cross-validation and report the acurracy): https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture.predict
Toy example:
g = GaussianMixture(n_components = 35)
g.fit(train_data)# fit model
y_pred = g.predict(test_data)
EDIT:
There are several options to measure the performance of your unsupervised case. For GMM, which base on real probabilities, the most common are BIC and AIC. They are immediatly included in the scikit GMM class.

Why are my r^2 values so consistently negative?

I'm not sure if the problem is with my regression estimator models, or with my understanding of what the r^2 measure-of-fittedness actually means. I am working on a project using scikit learn and ~11 different regression estimators in order to produce (rough!) predictions of baseball fantasy performance. Certain models always fare better than others (Decision Tree Regression and Extra Tree Regression produce the worst r^2 scores, while ElasticCV and LassoCV produce the best r^2 scores and every once in a while might even be a slightly positive number!).
If a horizontal line produces an r^2 score of 0, then even if all my models were worthless, and literally have zero predictive value, and are spitting out numbers entirely at random, shouldn't then I get small positive numbers for r^2 sometimes, if from sheer dumb luck alone? 8 of the 11 estimators i use, despite running over different datasets hundreds of times, have never once produced even a tiny positive number for r^2.
Am I misunderstanding how r^2 works?
I am not switching the order in sklearn's .score function either. I have double checked this many times. When I do put the order of y_pred, y_true in the wrong way, it yields r^2 values that are hugely negative (like <-50 big)
The fact that thats the case actually lends more to my confusion as to how r^2 here is a measure of fittedness, but I digress...
## I don't know whether I'm supposed to include my df4 or even a
##sample, but suffice to say here is just a single row to show what
##kind of data we have. It is all normalized and/or zscore'd
"""
>> print(df4.head(1))
HomeAway ParkFactor Salary HandedVs Hand oppoBullpen \
Points
3.0 1.0 -1.229 -0.122111 1.0 0.0 -0.90331
RibRunHistory BibTibHistory GrabBagHistory oppoTotesRank \
Points
3.0 0.964943 0.806874 -0.224993 -0.846859
oppoSwipesRank oppoWalksRank Temp Precip WindSpeed \
Points
3.0 -1.40371 -1.159115 -0.665324 -0.380048 -0.365671
WindDirection oppoPositFantasy oppoFantasy
Points
3.0 0.229944 -1.011505 0.919269
"""
def ElasticNetValidation(df4):
X = df4.values
y = df4.index
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
ENTrain = ElasticNetCV(cv=20)
ENTrain.fit(X_train, y_train)
y_pred = ENTrain.predict(X_test)
EN = ElasticNetCV(cv=20)
ENModel = EN.fit(X, y)
print('ElasticNet R^2: ' + str(r2_score(y_test, y_pred)))
scores = cross_val_score(ENModel, X, y, cv=20)
print("ElasticNet Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
return ENModel
When i run this estimator, along with ten other regression estimators I have been experimenting with, I get both r2_score() and cross_val_score().mean() showing negative numbers nearly every time. Certain estimators ALWAYS produce negative scores that are not even close to zero (decision tree regressor, extra tree regressor). Certain estimators fare better and even sometimes produce a tiny positive score, never more than 0.01 though, and even those estimators (elasticCV, lassoCV, linearRegression) are negative most of the time, albeit only slightly negative.
Even if these models I'm building are horrible. SAy they are totally random and have no predictive power whatsoever when it comes to the target: shouldn't it predict better than a plain horizontal line as often as not? How is it that an unrelated model is predicting POORER than a horizontal line so consistently?
You most likely have issues with overfitting. As you mentioned correctly, negative R2 values can occur if your model performs worse than just fitting an intercept term. Your models do probably not capture any 'real' underlying dependence but merely fit random noise. You are calculating the R2 score on a small test set and it is very well possible that this fitting of noise yields consistently worse result than a simple intercept term on the test set.
This is a typical case of bias-variance tradeoff. Your models have low bias and high variance and therefore perform poorly on the test data. There are certain models that aim at reducing overfit / variance, for example the Lasso and Elastic Net. These models actually are among the models that you see performing better.
In order to convince yourself that the sklearn's r2_score function works properly and to get familiarised with it, I would recommend that you first fit and predict your model on training data only (leave out the CV as well). R2 can never be negative in this case. Also make sure that your models include an intercept term (wherever available).

Sklearn logistic regression optimized on a profit loss matrix with unbalanced classes

I need to train a Logistic Regression in Sklearn with different profit-loss weights between classes.
The positive class is a loss. Meaning that each time a negative happens, it costs the company, say 1.000$. This obviously happens to both True Positive and False Negative cases.
On the other hand each negative case (both True Negative and False Positive) makes the company gain 50$.
The question is: how do I train, say, a Logistic Regression classifier in SkLearn to maximizes the prifit?
A further complication is that the positive and negative classes are unbalanced meaning that the Positives represent a 5% of overall sample size while the Negatives rerpesent a 95% of overall sample size.
Thanks for helping

Keras: how to figure out the Null hypothesis?

I am training a deep neural net using keras. One of the scores is called val_acc. I get like a 70% val_acc. How do I know if this is good or bad? The neural net is a binary classifier, so I am trying to predict a 1 or a 0. The data itself is about 65% 0's and 35% 1's. Is my 70% val_acc any good?
Accuracy is not always the right metric for the evaluation of a classifier. For example, it could be more important for you to classify the 1s more correctly than 0s (for example fraud detection) or the other way. So you may be interested to have a classifier with higher precision (specificity) or recall (sensitivity). In other words, false positives may be more expensive for you than false negatives. If you have some idea about the costs of misclassifications (e.g. for FPs & FNs) then you can precisely compute the specific threshold that will be optimal (instead of default 0.5) for 0-1 classification. You can use ROC curves and AUC to find performance of your classifier as well (the higher AUC the better). Finally you may want to consider kappa statistics to find how useful / effective your classifier is.

Resources