Does it make sense to use scikit-learn cross_val_predict() to (i)make predictions with unseen data in k-fold cross-validation and (ii)compare models? - scikit-learn

I'm training and evaluating a logistic regression and a XGBoost classifier.
With the XGBoost classifier, a training/validation/test split of the data and the subsequent training and validation shows the model is overfitting the training data. So, I'm working with k-fold cross-validation to reduce overfitting.
To work with k-fold cross-validation, I'm splitting my data into training and test sets and performing the k-fold cross-validation on the training set. The code looks something like the following:
model = XGBClassifier()
kfold = StratifiedKFold(n_splits = 10)
results = cross_val_score(model, x_train, y_train, cv = kfold)
The code works. Now, I've read several forums and blogs on how to make predictions after a k-fold cross-validation, but after these readings, I'm still not sure about the proper way of doing the predictions.
It would seem that using the cross_val_predict() method from sklearn.model_selection and using the test set is OK. The code would look something like the following:
y_pred = cross_val_predict(model, x_test, y_test, cv = kfold)
The code works, but the issue is whether this makes sense since I've seen more complicated ways of doing so and where it doesn't seem clear whether the training or the test set should be used for the predictions.
And if this makes sense, computing the accuracy score and the confusion matrix would be as simple as running something like the following:
accuracy = metrics.accuracy_score(y_test, y_pred)
cm = metrics.confusion_matrix(y_test, y_pred)
These two would help compare the logistic regression and the XGBoost classifier. Does this way of making predictions and evaluating models make sense?
Any help is appreciated! Thanks!

I want to answer this question I posted myself by summarizing things I have read and tried.
First, I want to clarify that the idea behind splitting my data into training/test sets and performing the k-fold cross-validation on the training set is to reserve the test set for providing a generalization error in much the same way we split data into training/validation/test sets and use the test set for providing a generalization error. For the sake of clarity, let me split the discussion into 2 sections.
Section 1
Now, reading more stuff, it's clearer to me cross_val_predict() returns the predictions that were obtained during the cross-validation when the elements were in a test set (see section 3.1.1.2 in this scikit-learn cross-validation doc). This test set refers to one of the test sets the cross-validation procedure internally creates (cross-validation creates a test set in each fold). Thus:
y_pred = cross_val_predict(model, x_train, y_train, cv = kfold)
returns the predictions from the cross-validation internal test sets. It then seems safe to obtain the accuracy and confusion matrix with:
accuracy = metrics.accuracy_score(y_train, y_pred)
cm = metrics.confusion_matrix(y_train, y_pred)
While cross_val_predict(model, x_test, y_test, cv = kfold) runs, it seems doing this doesn't make much sense.
Section 2
From some blogs that talk about creating a confusion matrix after a cross-validation procedure (see here and here), I borrowed code that, for each fold of the cross-validation, extracts the labels and predictions from the internal test set. These labels and predictions are later used to compute the confusion matrix. Assuming I store the labels and predictions in variables called actual_classes and predicted_classes, respectively, I then run:
accuracy = metrics.accuracy_score(actual_classes, predicted_classes)
cm = metrics.confusion_matrix(actual_classes, predicted_classes)
The results are exactly the same as the ones from Section 1's equivalent code. This reinforces that cross_val_predict(model, x_train, y_train, cv = kfold) works fine.
Thus:
Does it make sense to use scikit-learn cross_val_predict() to make
predictions with unseen data in k-fold cross-validation? I would say
No, it doesn't since cross_val_predict() makes predictions with
the internal test sets from the cross-validation procedure. It
seems that to make predictions with unseen data and compute a
generalization error we would need a way to extract one of the
models from the cross-validation procedure (e.g., see this
question)
Does it make sense to use scikit-learn cross_val_predict() to
compare models? I would say Yes, it does as long as the method is
executed as shown in Section 1. The accuracy and confusion matrix
could be used to make comparisons against other models.
Any comment is appreciated! Thanks!

Related

GridSearchCV, Data Leaks & Production Process Clarity

I've read a bit about integrating scaling with cross-fold validation and hyperparameter tuning without risking data leaks. The most sensical solution I've found (according to my knowledge) involves creating a pipeline that includes the scalar and GridSeachCV, for when you want to grid search and cross-fold validate. I've also read that, even when using cross-fold validation, it is useful to, at the very beginning, create a hold-out test set for an additional, final evaluation of your model after hyperparameter tuning. Putting that all together looks like this:
# train, test, split, unscaled data to create a final test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# instantiate pipeline with scaler and model, so that each training set
# in each fold is fit to the scalar and each training/test set in each fold
# is respectively transformed by fit scalar, preventing data leaks between each test/train
pipe = Pipeline([('sc', StandardScaler()),
('knn', KNeighborsClassifier())
])
# define hypterparameters to search
params = {'knn_n_neighbors': [3, 5, 7, 11]}
# create grid
search = GridSearchCV(estimator=pipe,
param_grid=params,
cv=5,
return_train_Score=True)
search.fit(X_train, y_train)
Assuming my understanding and the above process is correct, my question is what's next?
My guess is that we:
fit X_train to our scaler
transform X_train and X_test with our scaler
train a new model using X_train and our newly discovered best parameters from the grid search process
test the new model with our very first holdout-test set.
Presumably, because the Gridsearch evaluated models with scaling based on various slices of the data, the difference in values from scaling our final, whole train and test data should be fine.
Finally, when it is time to process completely new data points through our production model, do those datapoints need to be transformed according to the scalar fit to our original X_train?
Thank you for any help. I hope I am not completely misunderstanding fundamental aspects of this process.
Bonus Question:
I've seen example code like above from a number of sources. How does pipeline know to fit the scalar to the crossfold's training data, then transform the training and test data? Usually we have to define that process:
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)
GridSearchCV will help you find the best set of hyperparameter according to your pipeline and dataset. In order to do that it will use cross validation (split the your train dataset into 5 equal subset in you case). This means that your best_estimator will be trained on 80% of the train set.
As you know the more data a model see, the better its result is. Therefore once you have the optimal hyperparameters, it is wise to retrain the best estimator on all your training set and assess its performance with the test set.
You can retrain the best estimator using the whole train set by specifying the parameter refit=True of the Gridsearch and then score your model on the best_estimator as follows:
search = GridSearchCV(estimator=pipe,
param_grid=params,
cv=5,
return_train_Score=True,
refit=True)
search.fit(X_train, y_train)
tuned_pipe = search.best_estimator_
tuned_pipe.score(X_test, y_test)

How to use .fit() with cross validation

I'm pretty new to data science and a bit confused.
And just want to ensure that my approach makes sense.
I create modells like:
lr7 = GaussianNB().fit(X_train,y_train)
and using the cross_val_predict() right after.
y_pred8 = cross_val_predict(lr8, X_test, y_test, cv=5, n_jobs=-1, verbose=5)
Wouldn't it make much more sense to cross validate the train set first?
There is also a cross_validate()function in scikitlearn.
Is it correct to use this one with the train dataset? In the documentation they use X and y for both and not train/test splitted data.
A simple way to implement cross-validation is by using the cross_val_score function ( from sklearn. This may be appropriate for your problem.
# build model
lr7 = GaussianNB()
scores = cross_val_score(lr7, X, y, cv=5)
Note that in cross validation you use either the whole dataset or the training part X_train, y_train, but never the test part as you show in your code.

Forecasting/prediction using ARIMA in python - how does it work?

I am very confused about how to predict/forecast using ARIMA.
Lets assume we have a series called y_orig that we split into y_train and y_test. Assuming that y_orig is not stationary, we could fit ARIMA using the code below
# fit ARIMA model
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(y_train, order=(2,1,2))
model_fit = model.fit(disp=0)
print(model_fit.summary())
After fitting the model, we can predict using the code below
n_periods = len(`y_test`)
fc, -, - = model_fit.forecast(n_periods, alpha=0.05) # 95% conf
The value fc should give a forecast which i then compare to y_test. Please note that as expected, y_test is not used in the training phase. Also note that i am not looking for a rolling forecast but for a long term forecast where the parameters (once trained) are fixed.
I am very confused because y_test is not used at all in the forecasting phase.
For instance, if we were to use other prediction models (like in Keras or tensorflow). we would be coding it that way.
First, we fit the model in the training phase which i dont show- it does not matter for my question. Then we predict and see how good our fit is in sample using the code below.
y_pred_train=model.predict(y_train)
then we test the model out of sample as below:
y_pred_test=model.predict(y_test)
In this situation, the parameters are not re-estimated and y_test is used in the testing phase to forecast the next value (with fixed parameters).
Hence my confusion with ARIMA. Why do we not do the same with ARIMA model?
Please help me understand as i am very confused.
Thanks so much!!
I think you're a bit confused by the .fit and the y_train in the ARIMA code block. y_train is just a poorly named variable here, it should just be y, the data I want to forecast. The ARIMA model has no training/test phase, it's not self-learning. It does a statistical analysis of the input data, and does a forecast. If you want to do another forecast (on y_test), you need to do another statistical analysis (using model.fit) and do another forecast (using model.forecast). The ARIMA model does not have any weights it trains in a training phase, nothing related to any previous data 'fitted' on is saved in the model. You can't use a "fitted" ARIMA model to forecast other data samples.

How to apply Monte Carlo cross validation to multiple linear regression in Python?

everyone.
So, I am relatively new to Python and I am trying to predict a numeric variable based on 10 different numeric inputs. In particular, I am trying to apply multiple linear regression, but would like to add Monte Carlo cross-validation in the train-test-validation phase. So, I wrote a code that looks like this:
#I have imported libraries
#imported the dataset
#then created X and Y df.
#then split the data into training and testing, with validation parameters as follows:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=np.random.randint(1000), test_size=0.3)
# I have used np.random.randint(1000) as a Monte Carlo cross validation.
The code used for regression is:
#Linear Regression Model
regressor = linear_model.LinearRegression()
regressor.fit(X_train, Y_train)
y_predLR = regressor.predict(X_test)
lin_mse = mean_squared_error(y_predLR, Y_test)
lin_rmse = np.sqrt(lin_mse)
My question is: is this the right way to apply Monte Carlo cross validation?
After this, I applied MLR, and with each run of the code, the R squared, MSE and other values change, so I am guessing the Monte Carlo worked. If so, is there any way to get the same results with each run, but at the same time to use MCCV?
Moreover, the goal is to also develop an ANN model (also with Monte Carlo), and eventually to compare MLR and ANN, and then make predictions for the future period using the best developed model. I read someplace that MCCV can not be used when making predictions, is this right?
Many thanks for your time.
In order to apply MCCV you should run the process of randomly generating (without replacement) the training set and the test set multiple times.
So, roughly speaking, you need to insert your code (generation of training/test sets, learning and prediction) inside a for loop.
Note that the partitions are generated independently for each run, therefore the same data point can appear multiple times in the training (test) set, which is, in fact, the significant difference with k-fold cross validation.

Scikit SVM gives very poor accuracy for STL-10 dataset

I am using Scikit-learn SVM for training my model for STL-10 dataset which contains 5000 training images (10 pre-defined folds). So I have 5000*96*96*3 size dataset for training and test purposes. I used following code to train it and measure the accuracy for the test set. (80% 20%). Final result was 0.323 accuracy. How can I increase the accuracy for SVM.
This is STL10 dataset
def train_and_evaluate(clf, train_x, train_y):
clf.fit(train_x, train_y)
#make 2D array as we can apply only 2d to fit() function
nsamples, nx, ny, nz = images.shape
reshaped_train_dataset = images.reshape((nsamples, nx * ny * nz))
X_train, X_test, Y_train, Y_test = train_test_split(reshaped_train_dataset, read_labels(LABEL_PATH), test_size=0.20, random_state=33)
train_and_evaluate(my_svc, X_train, Y_train)
print(metrics.accuracy_score(Y_test, clf2.predict(X_test)))
So it seems you are using raw SVM directly on the images. That is usually not a good idea (it is rather bad actually).
I will describe the classic image-classification pipeline popular in the last decades! Keep in mind, that the highest performing approaches right now might use Deep Neural Networks to combine some of these steps (a very different approach; a lot of research in the last years!)
First step:
Preprocessing is needed!
Normalize mean and variance (i would not expect your dataset to be already normalized)
Optional: histogram-equalization
Second step:
Feature-extraction -> you should learn some features from these images. There are a lot of approaches including
(Kernel-)PCA
(Kernel-)LDA
Dictionary-learning
Matrix-factorization
Local binary patterns
... (just test with LDA initially)
Third:
SVM for classification
again there might be a Normalization-step needed before this and as mentioned in the comments by #David Batista: there might be some parameter-tuning needed (especially for Kernel-SVM)
It is also not clear, if using color-information is wise here. For more simple approaches i expect black-and-white images to be superior (you are losing information but tuning your pipeline is more robust; high-performance approaches will of course use color-information).
See here for some random tutorial describing a similar problem. While i don't know if it's good work, you could immediatly recognize the processing-pipeline mentioned above (preprocessing, feature-extraction, classifier-learning)!
Edit:
Why preprocessing?: some algorithms assume centered samples with unit-variance, therefore normalization is needed. This is (at least) very important for PCA, LDA and SVM's.

Resources