Why are my r^2 values so consistently negative? - scikit-learn

I'm not sure if the problem is with my regression estimator models, or with my understanding of what the r^2 measure-of-fittedness actually means. I am working on a project using scikit learn and ~11 different regression estimators in order to produce (rough!) predictions of baseball fantasy performance. Certain models always fare better than others (Decision Tree Regression and Extra Tree Regression produce the worst r^2 scores, while ElasticCV and LassoCV produce the best r^2 scores and every once in a while might even be a slightly positive number!).
If a horizontal line produces an r^2 score of 0, then even if all my models were worthless, and literally have zero predictive value, and are spitting out numbers entirely at random, shouldn't then I get small positive numbers for r^2 sometimes, if from sheer dumb luck alone? 8 of the 11 estimators i use, despite running over different datasets hundreds of times, have never once produced even a tiny positive number for r^2.
Am I misunderstanding how r^2 works?
I am not switching the order in sklearn's .score function either. I have double checked this many times. When I do put the order of y_pred, y_true in the wrong way, it yields r^2 values that are hugely negative (like <-50 big)
The fact that thats the case actually lends more to my confusion as to how r^2 here is a measure of fittedness, but I digress...
## I don't know whether I'm supposed to include my df4 or even a
##sample, but suffice to say here is just a single row to show what
##kind of data we have. It is all normalized and/or zscore'd
>> print(df4.head(1))
HomeAway ParkFactor Salary HandedVs Hand oppoBullpen \
3.0 1.0 -1.229 -0.122111 1.0 0.0 -0.90331
RibRunHistory BibTibHistory GrabBagHistory oppoTotesRank \
3.0 0.964943 0.806874 -0.224993 -0.846859
oppoSwipesRank oppoWalksRank Temp Precip WindSpeed \
3.0 -1.40371 -1.159115 -0.665324 -0.380048 -0.365671
WindDirection oppoPositFantasy oppoFantasy
3.0 0.229944 -1.011505 0.919269
def ElasticNetValidation(df4):
X = df4.values
y = df4.index
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
ENTrain = ElasticNetCV(cv=20)
ENTrain.fit(X_train, y_train)
y_pred = ENTrain.predict(X_test)
EN = ElasticNetCV(cv=20)
ENModel = EN.fit(X, y)
print('ElasticNet R^2: ' + str(r2_score(y_test, y_pred)))
scores = cross_val_score(ENModel, X, y, cv=20)
print("ElasticNet Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
return ENModel
When i run this estimator, along with ten other regression estimators I have been experimenting with, I get both r2_score() and cross_val_score().mean() showing negative numbers nearly every time. Certain estimators ALWAYS produce negative scores that are not even close to zero (decision tree regressor, extra tree regressor). Certain estimators fare better and even sometimes produce a tiny positive score, never more than 0.01 though, and even those estimators (elasticCV, lassoCV, linearRegression) are negative most of the time, albeit only slightly negative.
Even if these models I'm building are horrible. SAy they are totally random and have no predictive power whatsoever when it comes to the target: shouldn't it predict better than a plain horizontal line as often as not? How is it that an unrelated model is predicting POORER than a horizontal line so consistently?

You most likely have issues with overfitting. As you mentioned correctly, negative R2 values can occur if your model performs worse than just fitting an intercept term. Your models do probably not capture any 'real' underlying dependence but merely fit random noise. You are calculating the R2 score on a small test set and it is very well possible that this fitting of noise yields consistently worse result than a simple intercept term on the test set.
This is a typical case of bias-variance tradeoff. Your models have low bias and high variance and therefore perform poorly on the test data. There are certain models that aim at reducing overfit / variance, for example the Lasso and Elastic Net. These models actually are among the models that you see performing better.
In order to convince yourself that the sklearn's r2_score function works properly and to get familiarised with it, I would recommend that you first fit and predict your model on training data only (leave out the CV as well). R2 can never be negative in this case. Also make sure that your models include an intercept term (wherever available).


How does logistic regression build Sigmoid curve from categorical dependent variable?

I'm exploring the Scikit-learn logistic regression algorithm. I understand that as part of the training, the algorithm builds a regression curve where the y-variable ranges from 0 to 1 (sigmoid S-curve). The y-variable is a continuous variable here (although in reality it is a discrete variable). .
How is the algorithm able to learn the S-curve, when the training dataset reflects reality and includes the y-variable as a discrete variable? There is no probability estimate in the training, so I'm wondering how is the algorithm able to learn the S-curve.
There is no probability estimate in the training
Sure, but we pretend there is for modeling purposes. We want to maximize the probability of, as you call it, “reality”—if the observed response (the discrete value you refer to) is a 0, we want to predict that with probability 1; similarly, if the response is a 1, we want to predict that with probability 1.
Fitting the model to one data point, getting the right answer with probability 1, would be easy. Of course, we have more than one data point. We have to balance concerns between these. We want the predicted value sigmoid(weights * features) to be close to the true response (0 or 1) for all of the data points, but there may not be a way to set the parameters of the model to achieve this. (That is, the data may not be linearly separable.)
Good question! The fitting process in logistic regression is a search procedure that seeks the beta coefficients that minimize the error in the probabilities predicted by the model (continuous values) and the data (discrete values).
In logistic regression, you model probabilities using a logistic function (also known as a sigmoid function):
XB = B0 + B1 * X1 + B2 * X2 + ... + BN * XN
p(X) = e^(XB) / (1 + e^(XB))
The algorithm tries to find the beta coefficients that minimize the error using Maximum Likelihood estimation. The function to be minimized is called the cost function, and it can be any number of things. The most common ones are:
sum (P(X_i) - y_i)^2
sum |P(X_i) - y_i|
A random set of betas is picked at random, the cost is calculated and the algorithm will pick a new set of betas that will result in a lower cost. The algorithm stops searching for new betas when the decrease in cost is smaller than a given threshold (set by the tol parameter in sklearn).
The way the model converges to the final set of coefficients depends on the solver parameter. Each solver has a different way of converging to the final set of betas, but they usually converge to the same results.

What do sklearn.cross_validation scores mean?

I am working on a time-series prediction problem using GradientBoostingRegressor, and I think I'm seeing significant overfitting, as evidenced by a significantly better RMSE for training than for prediction. In order to examine this, I'm trying to use sklearn.model_selection.cross_validate, but I'm having problems understanding the result.
First: I was calculating RMSE by fitting to all my training data, then "predicting" the training data outputs using the fitted model and comparing those with the training outputs (the same ones I used for fitting). The RMSE that I observe is the same order of magnitude the predicted values and, more important, it's in the same ballpark as the RMSE I get when I submit my predicted results to Kaggle (although the latter is lower, reflecting overfitting).
Second, I use the same training data, but apply sklearn.model_selection.cross_validate as follows:
cross_validate( predictor, features, targets, cv = 5, scoring = "neg_mean_squared_error" )
I figure the neg_mean_squared_error should be the square of my RMSE. Accounting for that, I still find that the error reported by cross_validate is one or two orders of magnitude smaller than the RMSE I was calculating as described above.
In addition, when I modify my GradientBoostingRegressor max_depth from 3 to 2, which I would expect reduces overfitting and thus should improve the CV error, I find that the opposite is the case.
I'm keenly interested to use Cross Validation so I don't have to validate my hyperparameter choices by using up Kaggle submissions, but given what I've observed, I'm not clear that the results will be understandable or useful.
Can someone explain how I should be using Cross Validation to get meaningful results?
I think there is a conceptual problem here.
If you want to compute the error of a prediction you should not use the training data. As the name says theese type of data are used only in training, for evaluating accuracy scores you ahve to use data that the model has never seen.
About cross-validation I can tell that it's an approach to find the best training/testing set. The process is as follows: you divide your data into n groups and you do various iterating changing the testing group you pick. If you have n groups you will do n iteration and each time the training and testing set will be different. It's more understamdable in the image below.
Basically what you should do it's kile this:
Train the model using months from 0 to 30 (for example)
See the predictions made with months from 31 to 35 as input.
If the input has to be the same lenght divide feature in half (should be 17 months).
I hope I understood correctly, othewise comment.

Is it acceptable to scale target values for regressors?

I am getting very high RMSE and MAE for MLPRegressor , ForestRegression and Linear regression with only input variables scaled (30,000+) however when i scale target values aswell i get RMSE (0.2) , i will like to know if that is acceptable thing to do.
Secondly is it normal to have better R squared values for Test (ie. 0.98 and 0.85 for train)
Thank You
Answering your first question, I think you are quite deceived by the performance measures which you have chosen to evaluate your model with. Both RMSE and MAE are sensitive to the range in which you measure your target variables, if you are going to scale down your target variable then for sure the values of RMSE and MAE will go down, lets take an example to illustrate that.
def rmse(y_true, y_pred):
return np.sqrt(np.mean(np.square(y_true - y_pred)))
def mae(y_true, y_pred):
return np.mean(np.abs(y_true - y_pred))
I have written two functions for computing both RMSE and MAE. Now lets plug in some values and see what happens,
y_true = np.array([2,5,9,7,10,-5,-2,2])
y_pred = np.array([3,4,7,9,8,-3,-2,1])
For the time being let's assume that the true and the predicted vales are as shown above. Now we are ready to compute RMSE and MAE for this data.
mae(y_true, y_pred)
Now let's scale down our target variable by a factor of 10 and compute the same measure again.
y_scaled_true = np.array([2,5,9,7,10,-5,-2,2])/10
y_scaled_pred = np.array([3,4,7,9,8,-3,-2,1])/10
We can now very well see that just by scaling our target variable our RMSE and MAE scores have dropped creating an illusion that our model has improved, but actually NOT. When we scale back our model's predictions we are into the same state.
So coming to the point, MAPE (Mean Absolute Percentage Error) could be a better way to measure your performance of the model and it is insensitive to the scale in which the variables are measure. If you compute MAPE for both the sets of values we see that they are same,
def mape(y, y_pred):
return np.mean(np.abs((y - y_pred)/y))
So it is better to rely on MAPE over MAE or RMSE, if you want your performance measure to be independent on the scale in which they are measured.
Answering your second question, since you are dealing with some complicated models like MLPRegressor and ForestRegression which has some hyper-parameters which needs to be tuned to avoid over fitting, the best way to find the ideal levels of the hyper-parameters is to divide the data into train, test and validation and use techniques like K-Fold Cross Validation to find the optimal setting. It is quite difficult to say if the above values are acceptable or not just by looking at this one case.
It is actually a common practice to scale target values in many cases.
For example a highly skewed target may give better results if it is applied log or log1p transforms. I don't know the characteristics of your data, but there could a transformation that might decrease your RMSE.
Secondly, Test set is meant to be a sample of unseen data, to give a final estimate of your model's performance. When you see the unseen data and tune to perform better on it, it becomes a cross validation set.
You should try to split your data into three parts, Train, Cross-validation and test sets. Train on your data and tune parameters according to it's performance on cross validation and then after you are done tuning, run it on the test set to get a prediction of how it works on unseen data and mark it as the accuracy of your model.

Model evaluation : model.score Vs. ROC curve (AUC indicator)

I want to evaluate a logistic regression model (binary event) using two measures:
1. model.score and confusion matrix which give me a 81% of classification accuracy
2. ROC Curve (using AUC) which gives back a 50% value
Are these two result in contradiction? Is that possible
I'missing something but still can't find it
y_pred = log_model.predict(X_test)
accuracy_score(y_test , y_pred)
cm = confusion_matrix( y_test,y_pred )
print (cm)
tpr , fpr, _= roc_curve( y_test , y_pred, drop_intermediate=False)
roc = roc_auc_score( y_test ,y_pred)
enter image description here
The accuracy score is calculated based on the assumption that a class is selected if it has a prediction probability of more than 50%. This means that you are looking only at 1 case (one working point) out of many. Let's say you'd like to classify an instance as '0' even if it has a probability greater than 30% (this may happen if one of your classes is more important for you, and its a-priori probability is very low). In this case - you will have a very different confusion matrix with a different accuracy ([TP+TN]/[ALL]). The ROC auc score examines all of these working points and gives you an estimation of your overall model. A score of 50% means that the model is equal to a random selection of classes based on your a-priori probabilities of the classes. You would like the ROC to be much higher to say that you have a good model.
So in the above case - you can say that your model does not have a good prediction strength. As a matter of fact - a better prediction will be to predict everything as "1" - in your case it will lead to an accuracy of above 99%.

GridSearchCV: based on mean_test_score results, predict should perform much worse, but it does not

I am trying to evaluate the performance of a regressor by means of GridSearchCV. In my implementation cv is an int, so I'm applying the K-fold validation method. Looking at cv_results_['mean_test_score'],
the best mean score on the k-fold unseen data is around 0.7, while the train scores are much higher, like 0.999. This is very normal, and I'm ok with that.
Well, following the reasoning behind this concept, when I apply the best_estimator_ on the whole data set, I expect to see at least some part of the data predicted not perfectly, right? Instead, the numerical deviations between the predicted quantities and the real values are near zero for all datapoints. And this smells of overfitting.
I don't understand that, because if I remove a small part of the data and apply GridSearchCV to the remaining part, I find almost identical results as above, but the best regressor applied to the totally unseen data predicts with much higher errors, like 10%, 30% or 50%. Which is what I expected, at least for some points, fitting GridSearchCV on the whole set, based on the results of k-fold test sets.
Now, I understand that this forces the predictor to see all datapoints, but the best estimator is the result of k fits, each of them never saw 1/k fraction of data. Being the mean_test_score the average between these k scores, I expect to see a bunch of predictions (depending on cv value) which show errors distributed around a mean error that justifies a 0.7 score.
The refit=True parameter of GridSearchCV makes the estimator with the found best set of hyperparameters be refit on the full data. So if your training error is almost zero in the CV folds, you would expect it to be near zero in the best_estimator_ as well.
