Speed up cross validation in SVM scoring - scikit-learn

I want to get the accracy, average_precision, F1, precision, recall and roc_auc scores
I do realise that using the code below, I'll get the average_precision, the problem is when running this code it takes about 20 minute to show the results, is there a better way to get all the scores above in a shorter amount of time ?
clf_svm_2_scores_avg_precision = cross_val_score(clf_svm_2, np.array(x), data['link'], cv=5, scoring='average_precision')

Cross validation is generally long, as the training/validation process is generally done several times and then a mean of the scores is calculated. You may try to add the argument n_jobs, in which you have to set the number of cores of your computer.

Related

Less features, longer model training time

I use machine learning algorithm in Malware analysis. When I input some features, I get strange training time. For example:
4 feature(A,B,C,D), model training time is 3 seconds.
3 Features(A,B,C), training time is 5 seconds.
2 features(A, B), training time is 8 seconds.
1 feature(A), training time is 4 seconds.
This kind of result happens on both MLP and Random Forest. In my opinion, the training time should be faster if I use less features, but the result is complete different.
In KNN, the result will be like these:
If I using 6,5,4,3 features(A,B,C,D,E,F), model testing time is about 1.1 seconds, almost the same.
2 features(A,B), model testing time is 3 seconds.
1 feature (A), model testing time is 5 seconds.
My dataset has 17K records and using 10-Fold cross-validation. The feature is sort by their entropy, feature A have highest entropy and feature F is lowest. Using Google Colab with sklearn for the testing. I tried several times in different date, and the trend is the same.
The feature of my dataset has total 79 features, the appearance only happens with few features.
Thanks for anyone who reply me, I have no idea about it.
It does seem at first glance that having fewer features will result in lower training times. However, depending on which algorithm is being used, this may not be the case. In training, an objective function (loss function) is being minimized by the algorithm. Taking the case of the MLP neural network, if you change the features (especially depending on whether they're informative or not), you're changing the feature space (or "error surface") over which the optimization occurs and possibly the minima of the function will be harder to find, resulting in more steps and longer training in order to satisfy the convergence criteria.

What do sklearn.cross_validation scores mean?

I am working on a time-series prediction problem using GradientBoostingRegressor, and I think I'm seeing significant overfitting, as evidenced by a significantly better RMSE for training than for prediction. In order to examine this, I'm trying to use sklearn.model_selection.cross_validate, but I'm having problems understanding the result.
First: I was calculating RMSE by fitting to all my training data, then "predicting" the training data outputs using the fitted model and comparing those with the training outputs (the same ones I used for fitting). The RMSE that I observe is the same order of magnitude the predicted values and, more important, it's in the same ballpark as the RMSE I get when I submit my predicted results to Kaggle (although the latter is lower, reflecting overfitting).
Second, I use the same training data, but apply sklearn.model_selection.cross_validate as follows:
cross_validate( predictor, features, targets, cv = 5, scoring = "neg_mean_squared_error" )
I figure the neg_mean_squared_error should be the square of my RMSE. Accounting for that, I still find that the error reported by cross_validate is one or two orders of magnitude smaller than the RMSE I was calculating as described above.
In addition, when I modify my GradientBoostingRegressor max_depth from 3 to 2, which I would expect reduces overfitting and thus should improve the CV error, I find that the opposite is the case.
I'm keenly interested to use Cross Validation so I don't have to validate my hyperparameter choices by using up Kaggle submissions, but given what I've observed, I'm not clear that the results will be understandable or useful.
Can someone explain how I should be using Cross Validation to get meaningful results?
I think there is a conceptual problem here.
If you want to compute the error of a prediction you should not use the training data. As the name says theese type of data are used only in training, for evaluating accuracy scores you ahve to use data that the model has never seen.
About cross-validation I can tell that it's an approach to find the best training/testing set. The process is as follows: you divide your data into n groups and you do various iterating changing the testing group you pick. If you have n groups you will do n iteration and each time the training and testing set will be different. It's more understamdable in the image below.
Basically what you should do it's kile this:
Train the model using months from 0 to 30 (for example)
See the predictions made with months from 31 to 35 as input.
If the input has to be the same lenght divide feature in half (should be 17 months).
I hope I understood correctly, othewise comment.

Issue Validating RandomizedSearchCV Results

I start with a basic Logistic Regression, using all defaults hyper-parameters. And I get a score of 0.8855
Question Next I run a RandomSearch to find the best hyper-parameters; According to the RandomSearch C=10 with Max_iterations=110 gives the score of 0.89
I run the logistic with these hyper parameters but get a much better accuracy, 0.91 !
Why am I not getting exactly the same number?
You will definitely not get the same accuracy when you run it again in your train set, this is because when you do k-fold cross validation to check the performance of a particular set of hyper parameters you will divide the entire data into k sets and use k-1 sets for training and validate it on the left over one set. And you repeat this process k times and each time you take a different set of data for validating. And finally you compute the average of all the k iterations and report your accuracy which is what you got in random_result.best_score_, the figure below explains the process
And now after getting the best set of hyperparameters you will fit it on the entire training data i.e. set 1, set 2 and set 3, so now it is prone to have some variations since the data has changed and you are evaluating on the entire train data. So what you observe is totally normal and the usual behavior.

Batch normalization setup train and test time

Recently ,I read so many articles talking about keras batch normalization had been discussed a lot.
According to this website:
Set “training=False” of “tf.layers.batch_normalization” when training will get a better validation result
The answer said that:
If you turn on batch normalization with training = True that will start to normalize the batches within themselves and collect a moving average of the mean and variance of each batch. Now here's the tricky part. The moving average is an exponential moving average, with a default momentum of 0.99 for tf.layers.batch_normalization(). The mean starts at 0, the variance at 1 again. But since each update is applied with a weight of ( 1 - momentum ), it will asymptotically reach the actual mean and variance in infinity. For example in 100 steps it will reach about 73.4% of the real value, because 0.99100 is 0.366. If you have numerically large values, the difference can be enormous.
Since my batch size is small which means that more steps to take , and the difference could be big between training and test which lead bad result while predicting.
So,I have to set the training=False in call ,which again from the link above said that:
When you set training = False that means the batch normalization layer will use its internally stored average of mean and variance to normalize the batch, not the batch's own mean and variance.
And I know that during test time we should use the moving mean and moving variance from training time.And I Know the
moving_mean_initializer can be set.
keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', moving_mean_initializer='zeros', moving_variance_initializer='ones', beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None)
I am not sure if my opinion is correct or not:
(1) set the training =False when testing andtraining=True when training
(2)Use hsitory_weight = ModelCheckpoint(filepath="weights.{epoch:02d}.hdf5",save_weights_only=True,save_best_only=False) to store the normalization weight(including moving average and variance of course gomma and beta)
(3) initialize it with what we get from step (2)
Not sure if anything that I mentioned above is wrong,if it it ,please do correct me.
And I am not sure how people typically do to deal with the problem?Is the one that I propose working?
Thanks in advance!
I do some test ,After training ,
I set all batch layers's moving mean and moving variance to zero.
And it gives the bad result.
I believe at inference mode,the keras would use moving mean and moving variance.
And the part training flag,no matter you set the to True or False the only difference between these two is
whether the moving variance and moving mean would be updated or not.

GridSearchCV: based on mean_test_score results, predict should perform much worse, but it does not

I am trying to evaluate the performance of a regressor by means of GridSearchCV. In my implementation cv is an int, so I'm applying the K-fold validation method. Looking at cv_results_['mean_test_score'],
the best mean score on the k-fold unseen data is around 0.7, while the train scores are much higher, like 0.999. This is very normal, and I'm ok with that.
Well, following the reasoning behind this concept, when I apply the best_estimator_ on the whole data set, I expect to see at least some part of the data predicted not perfectly, right? Instead, the numerical deviations between the predicted quantities and the real values are near zero for all datapoints. And this smells of overfitting.
I don't understand that, because if I remove a small part of the data and apply GridSearchCV to the remaining part, I find almost identical results as above, but the best regressor applied to the totally unseen data predicts with much higher errors, like 10%, 30% or 50%. Which is what I expected, at least for some points, fitting GridSearchCV on the whole set, based on the results of k-fold test sets.
Now, I understand that this forces the predictor to see all datapoints, but the best estimator is the result of k fits, each of them never saw 1/k fraction of data. Being the mean_test_score the average between these k scores, I expect to see a bunch of predictions (depending on cv value) which show errors distributed around a mean error that justifies a 0.7 score.
The refit=True parameter of GridSearchCV makes the estimator with the found best set of hyperparameters be refit on the full data. So if your training error is almost zero in the CV folds, you would expect it to be near zero in the best_estimator_ as well.

Resources