How do I test my classifier accuracy against random values? - scikit-learn

I've set up my first scikit-learn example to play with and I'm trying to gauge accuracy on my predictions. I've got training and test lists set up fine, but I'm getting ~0.95 accuracy even if I give it random values.
This looks to be because I'm checking for 0/1 labels, and 95% of the labels are zero's, so it's guessing on 0's and getting 0.95 accuracy (I think?). Obviously this isn't what I want.
How do I go about deciding if my classifiers are working, and how do I get meaningful accuracy values?

You have a clear class imbalance issue. Your classifier is predicting 0 all the time knowing it will be right 95% of the time. You can inspect this by calling predict(X_test) on your fitted classifier. If all the values are 0 you know this is the case.
To get a better idea on how the model performs you can upsample the data labelled with 1 or down sample the data labelled with 0. You can use this package which builds off scikit-learn and implements a number of resampling methods. Alternatively, you can use scikit learns resampling method. Which will bootstrap new data points for you.

Related

How to handle extreme outlier in the prediction of Random Forest Regression Model

I have done Prediction by using Random Forest Regression and I have got r2score of 90% which is good for this dataset I guess, after checking MSE I got more than 1500 as result of mse. Then I checked the difference between Actual and predicted value, there I found that my model was predicted only one value very poorly and so the mse error is too high. Now how to handle the only one value that is been predicted very poorly by my rfr model?
In my training set I have similar kind of data as well and so my model should have learnt from that, but somehow my prediction for that only one value is poor? What is the problem?

What do sklearn.cross_validation scores mean?

I am working on a time-series prediction problem using GradientBoostingRegressor, and I think I'm seeing significant overfitting, as evidenced by a significantly better RMSE for training than for prediction. In order to examine this, I'm trying to use sklearn.model_selection.cross_validate, but I'm having problems understanding the result.
First: I was calculating RMSE by fitting to all my training data, then "predicting" the training data outputs using the fitted model and comparing those with the training outputs (the same ones I used for fitting). The RMSE that I observe is the same order of magnitude the predicted values and, more important, it's in the same ballpark as the RMSE I get when I submit my predicted results to Kaggle (although the latter is lower, reflecting overfitting).
Second, I use the same training data, but apply sklearn.model_selection.cross_validate as follows:
cross_validate( predictor, features, targets, cv = 5, scoring = "neg_mean_squared_error" )
I figure the neg_mean_squared_error should be the square of my RMSE. Accounting for that, I still find that the error reported by cross_validate is one or two orders of magnitude smaller than the RMSE I was calculating as described above.
In addition, when I modify my GradientBoostingRegressor max_depth from 3 to 2, which I would expect reduces overfitting and thus should improve the CV error, I find that the opposite is the case.
I'm keenly interested to use Cross Validation so I don't have to validate my hyperparameter choices by using up Kaggle submissions, but given what I've observed, I'm not clear that the results will be understandable or useful.
Can someone explain how I should be using Cross Validation to get meaningful results?
I think there is a conceptual problem here.
If you want to compute the error of a prediction you should not use the training data. As the name says theese type of data are used only in training, for evaluating accuracy scores you ahve to use data that the model has never seen.
About cross-validation I can tell that it's an approach to find the best training/testing set. The process is as follows: you divide your data into n groups and you do various iterating changing the testing group you pick. If you have n groups you will do n iteration and each time the training and testing set will be different. It's more understamdable in the image below.
Basically what you should do it's kile this:
Train the model using months from 0 to 30 (for example)
See the predictions made with months from 31 to 35 as input.
If the input has to be the same lenght divide feature in half (should be 17 months).
I hope I understood correctly, othewise comment.

ResUNet Segmentation output is bad although precision and recall values are higher on training and validation

I recently have implemented the RESUNET for a parasite segmentation on blood sample images. The model is described in this papaer, https://arxiv.org/pdf/1711.10684.pdf and here is the code https://github.com/DuFanXin/deep_residual_unet/blob/master/res_unet.py. The segmentation output is a binary image. I trained the model with the weighted Binary cross-entropy Loss, given more weight to the parasite class since there is an imbalance of classes in my images. The last ouput layer has a sigmoid activation.
I calculate precision, recall, and Dice Coefficient value to verify how good is the segmentation on trainning. On training and validation I got good numerical results:
Training
dice_coeff: .6895, f2: 0.8611, precision: 0.6320, recall: 0.9563
Validation
val_dice_coeff: .6433, val_f2: 0.7752, val_precision: 0.6052, val_recall: 0.8499
However, when I try to visually see the segmentations of the validation set my algorithm outputs all black. After analyzing the predictions returned by the model, almost all values are close to zero, so it cannot correctly differenciate between background and foreground. The problems is: Why my metrics shows good numerical values but the segmentation ouput not?
I mean, the metrics are not giving me good information? Why the recall value is higher even if the output is all black?
I trained for about 50 epochs, and my training curves shows constantly learning. Is this because the vanishing gradient problem?
No, you do not have a vanishing gradient issue.
I am almost 100% sure that the problem is related to the way in which you test.
The numbers in your training/validation do not lie.
Ensure that you use the exact same preprocessing on your test dataset, exactly the same preprocessing that is applied during the training.
E.g. : If you use "rescale = 1/255.0" parameter in Keras ImageDataGenerator(), ensure that when you load the test image, divide it by 255.0 before predicting on it.
Note that the aforementioned is a pure example; your inconsistency in train/test preprocessing may stem from other reasons.

GridSearchCV: based on mean_test_score results, predict should perform much worse, but it does not

I am trying to evaluate the performance of a regressor by means of GridSearchCV. In my implementation cv is an int, so I'm applying the K-fold validation method. Looking at cv_results_['mean_test_score'],
the best mean score on the k-fold unseen data is around 0.7, while the train scores are much higher, like 0.999. This is very normal, and I'm ok with that.
Well, following the reasoning behind this concept, when I apply the best_estimator_ on the whole data set, I expect to see at least some part of the data predicted not perfectly, right? Instead, the numerical deviations between the predicted quantities and the real values are near zero for all datapoints. And this smells of overfitting.
I don't understand that, because if I remove a small part of the data and apply GridSearchCV to the remaining part, I find almost identical results as above, but the best regressor applied to the totally unseen data predicts with much higher errors, like 10%, 30% or 50%. Which is what I expected, at least for some points, fitting GridSearchCV on the whole set, based on the results of k-fold test sets.
Now, I understand that this forces the predictor to see all datapoints, but the best estimator is the result of k fits, each of them never saw 1/k fraction of data. Being the mean_test_score the average between these k scores, I expect to see a bunch of predictions (depending on cv value) which show errors distributed around a mean error that justifies a 0.7 score.
The refit=True parameter of GridSearchCV makes the estimator with the found best set of hyperparameters be refit on the full data. So if your training error is almost zero in the CV folds, you would expect it to be near zero in the best_estimator_ as well.

Test error lower than training error

Would appreciate your input on this. I am constructing a regression model with the help of genetic programming.
If my RMSE on test data is (much) lower than my RMSE on training data for a 1:5 ratio of data, should I be worried?
The test data is drawn randomly without replacement from a set of 24 data points. The model was built using genetic programming technique so the number of features, modeling framework etc vary as I minimize the training RMSE regularized by the number of nodes in the GP tree.
Is the model underfitted? Or should I have minimized MSE instead of RMSE (I thought it would be the same as MSE is positive and the minimum of MSE would coincide with the minimum of RMSE assuming the optimizer is good enough to find the minimum)?
Tks
So your model is trained on 20 out of 24 data points and tested on the 4 remaining data points?
To me it sounds like you need (much) more data, so you can have a larger train and test sets. I'm not surprised by the low performance on your test set as it seems that your model wasn't able to learn from such few data. As a rule of thumb, for machine learning you can never have enough data. Is it a possibility to gather a larger dataset?

Resources