Issue Validating RandomizedSearchCV Results - scikit-learn

I start with a basic Logistic Regression, using all defaults hyper-parameters. And I get a score of 0.8855
Question Next I run a RandomSearch to find the best hyper-parameters; According to the RandomSearch C=10 with Max_iterations=110 gives the score of 0.89
I run the logistic with these hyper parameters but get a much better accuracy, 0.91 !
Why am I not getting exactly the same number?

You will definitely not get the same accuracy when you run it again in your train set, this is because when you do k-fold cross validation to check the performance of a particular set of hyper parameters you will divide the entire data into k sets and use k-1 sets for training and validate it on the left over one set. And you repeat this process k times and each time you take a different set of data for validating. And finally you compute the average of all the k iterations and report your accuracy which is what you got in random_result.best_score_, the figure below explains the process
And now after getting the best set of hyperparameters you will fit it on the entire training data i.e. set 1, set 2 and set 3, so now it is prone to have some variations since the data has changed and you are evaluating on the entire train data. So what you observe is totally normal and the usual behavior.

Related

Speed up cross validation in SVM scoring

I want to get the accracy, average_precision, F1, precision, recall and roc_auc scores
I do realise that using the code below, I'll get the average_precision, the problem is when running this code it takes about 20 minute to show the results, is there a better way to get all the scores above in a shorter amount of time ?
clf_svm_2_scores_avg_precision = cross_val_score(clf_svm_2, np.array(x), data['link'], cv=5, scoring='average_precision')
Cross validation is generally long, as the training/validation process is generally done several times and then a mean of the scores is calculated. You may try to add the argument n_jobs, in which you have to set the number of cores of your computer.

What do sklearn.cross_validation scores mean?

I am working on a time-series prediction problem using GradientBoostingRegressor, and I think I'm seeing significant overfitting, as evidenced by a significantly better RMSE for training than for prediction. In order to examine this, I'm trying to use sklearn.model_selection.cross_validate, but I'm having problems understanding the result.
First: I was calculating RMSE by fitting to all my training data, then "predicting" the training data outputs using the fitted model and comparing those with the training outputs (the same ones I used for fitting). The RMSE that I observe is the same order of magnitude the predicted values and, more important, it's in the same ballpark as the RMSE I get when I submit my predicted results to Kaggle (although the latter is lower, reflecting overfitting).
Second, I use the same training data, but apply sklearn.model_selection.cross_validate as follows:
cross_validate( predictor, features, targets, cv = 5, scoring = "neg_mean_squared_error" )
I figure the neg_mean_squared_error should be the square of my RMSE. Accounting for that, I still find that the error reported by cross_validate is one or two orders of magnitude smaller than the RMSE I was calculating as described above.
In addition, when I modify my GradientBoostingRegressor max_depth from 3 to 2, which I would expect reduces overfitting and thus should improve the CV error, I find that the opposite is the case.
I'm keenly interested to use Cross Validation so I don't have to validate my hyperparameter choices by using up Kaggle submissions, but given what I've observed, I'm not clear that the results will be understandable or useful.
Can someone explain how I should be using Cross Validation to get meaningful results?
I think there is a conceptual problem here.
If you want to compute the error of a prediction you should not use the training data. As the name says theese type of data are used only in training, for evaluating accuracy scores you ahve to use data that the model has never seen.
About cross-validation I can tell that it's an approach to find the best training/testing set. The process is as follows: you divide your data into n groups and you do various iterating changing the testing group you pick. If you have n groups you will do n iteration and each time the training and testing set will be different. It's more understamdable in the image below.
Basically what you should do it's kile this:
Train the model using months from 0 to 30 (for example)
See the predictions made with months from 31 to 35 as input.
If the input has to be the same lenght divide feature in half (should be 17 months).
I hope I understood correctly, othewise comment.

How to select the best fit machine learning algorithm

while frequently running the machine learning algorithms the accuracy is changing in that case how to select the best fit algorithm for that particular data set.
You should definitely provide more details. It's impossible to suggest anything without the domain, the model architecture, hyperparameters.
I guess you are complaining due to changing of accuracy of the model. I think you should set seeds for randomized parameters so that accuracy don't change while training different times and you can reproduce your results.
numpy.random.seed(1)
random.seed(1)
tf.random.set_random_seed(1) # if using tensorflow
Lets assume , the question is for the same data set X (Training), everytime when we run the accuracy by comparing the predicted responses against our Testdata Dependent values(Y) .
If the accuracy keeps changing if we run the model seems, the issue is Sampling Bias ( the division of Training and Test data upholds a mystery).
When you import train_test_split function , use the random_state attribute wisely to keep the test data representative for the overall population of data.

Batch normalization setup train and test time

Recently ,I read so many articles talking about keras batch normalization had been discussed a lot.
According to this website:
Set “training=False” of “tf.layers.batch_normalization” when training will get a better validation result
The answer said that:
If you turn on batch normalization with training = True that will start to normalize the batches within themselves and collect a moving average of the mean and variance of each batch. Now here's the tricky part. The moving average is an exponential moving average, with a default momentum of 0.99 for tf.layers.batch_normalization(). The mean starts at 0, the variance at 1 again. But since each update is applied with a weight of ( 1 - momentum ), it will asymptotically reach the actual mean and variance in infinity. For example in 100 steps it will reach about 73.4% of the real value, because 0.99100 is 0.366. If you have numerically large values, the difference can be enormous.
Since my batch size is small which means that more steps to take , and the difference could be big between training and test which lead bad result while predicting.
So,I have to set the training=False in call ,which again from the link above said that:
When you set training = False that means the batch normalization layer will use its internally stored average of mean and variance to normalize the batch, not the batch's own mean and variance.
And I know that during test time we should use the moving mean and moving variance from training time.And I Know the
moving_mean_initializer can be set.
keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', moving_mean_initializer='zeros', moving_variance_initializer='ones', beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None)
I am not sure if my opinion is correct or not:
(1) set the training =False when testing andtraining=True when training
(2)Use hsitory_weight = ModelCheckpoint(filepath="weights.{epoch:02d}.hdf5",save_weights_only=True,save_best_only=False) to store the normalization weight(including moving average and variance of course gomma and beta)
(3) initialize it with what we get from step (2)
Not sure if anything that I mentioned above is wrong,if it it ,please do correct me.
And I am not sure how people typically do to deal with the problem?Is the one that I propose working?
Thanks in advance!
I do some test ,After training ,
I set all batch layers's moving mean and moving variance to zero.
And it gives the bad result.
I believe at inference mode,the keras would use moving mean and moving variance.
And the part training flag,no matter you set the to True or False the only difference between these two is
whether the moving variance and moving mean would be updated or not.

GridSearchCV: based on mean_test_score results, predict should perform much worse, but it does not

I am trying to evaluate the performance of a regressor by means of GridSearchCV. In my implementation cv is an int, so I'm applying the K-fold validation method. Looking at cv_results_['mean_test_score'],
the best mean score on the k-fold unseen data is around 0.7, while the train scores are much higher, like 0.999. This is very normal, and I'm ok with that.
Well, following the reasoning behind this concept, when I apply the best_estimator_ on the whole data set, I expect to see at least some part of the data predicted not perfectly, right? Instead, the numerical deviations between the predicted quantities and the real values are near zero for all datapoints. And this smells of overfitting.
I don't understand that, because if I remove a small part of the data and apply GridSearchCV to the remaining part, I find almost identical results as above, but the best regressor applied to the totally unseen data predicts with much higher errors, like 10%, 30% or 50%. Which is what I expected, at least for some points, fitting GridSearchCV on the whole set, based on the results of k-fold test sets.
Now, I understand that this forces the predictor to see all datapoints, but the best estimator is the result of k fits, each of them never saw 1/k fraction of data. Being the mean_test_score the average between these k scores, I expect to see a bunch of predictions (depending on cv value) which show errors distributed around a mean error that justifies a 0.7 score.
The refit=True parameter of GridSearchCV makes the estimator with the found best set of hyperparameters be refit on the full data. So if your training error is almost zero in the CV folds, you would expect it to be near zero in the best_estimator_ as well.

Resources