I'm pretty new to data science and a bit confused.
And just want to ensure that my approach makes sense.
I create modells like:
lr7 = GaussianNB().fit(X_train,y_train)
and using the cross_val_predict() right after.
y_pred8 = cross_val_predict(lr8, X_test, y_test, cv=5, n_jobs=-1, verbose=5)
Wouldn't it make much more sense to cross validate the train set first?
There is also a cross_validate()function in scikitlearn.
Is it correct to use this one with the train dataset? In the documentation they use X and y for both and not train/test splitted data.
A simple way to implement cross-validation is by using the cross_val_score function ( from sklearn. This may be appropriate for your problem.
# build model
lr7 = GaussianNB()
scores = cross_val_score(lr7, X, y, cv=5)
Note that in cross validation you use either the whole dataset or the training part X_train, y_train, but never the test part as you show in your code.
Related
I'm training and evaluating a logistic regression and a XGBoost classifier.
With the XGBoost classifier, a training/validation/test split of the data and the subsequent training and validation shows the model is overfitting the training data. So, I'm working with k-fold cross-validation to reduce overfitting.
To work with k-fold cross-validation, I'm splitting my data into training and test sets and performing the k-fold cross-validation on the training set. The code looks something like the following:
model = XGBClassifier()
kfold = StratifiedKFold(n_splits = 10)
results = cross_val_score(model, x_train, y_train, cv = kfold)
The code works. Now, I've read several forums and blogs on how to make predictions after a k-fold cross-validation, but after these readings, I'm still not sure about the proper way of doing the predictions.
It would seem that using the cross_val_predict() method from sklearn.model_selection and using the test set is OK. The code would look something like the following:
y_pred = cross_val_predict(model, x_test, y_test, cv = kfold)
The code works, but the issue is whether this makes sense since I've seen more complicated ways of doing so and where it doesn't seem clear whether the training or the test set should be used for the predictions.
And if this makes sense, computing the accuracy score and the confusion matrix would be as simple as running something like the following:
accuracy = metrics.accuracy_score(y_test, y_pred)
cm = metrics.confusion_matrix(y_test, y_pred)
These two would help compare the logistic regression and the XGBoost classifier. Does this way of making predictions and evaluating models make sense?
Any help is appreciated! Thanks!
I want to answer this question I posted myself by summarizing things I have read and tried.
First, I want to clarify that the idea behind splitting my data into training/test sets and performing the k-fold cross-validation on the training set is to reserve the test set for providing a generalization error in much the same way we split data into training/validation/test sets and use the test set for providing a generalization error. For the sake of clarity, let me split the discussion into 2 sections.
Section 1
Now, reading more stuff, it's clearer to me cross_val_predict() returns the predictions that were obtained during the cross-validation when the elements were in a test set (see section 3.1.1.2 in this scikit-learn cross-validation doc). This test set refers to one of the test sets the cross-validation procedure internally creates (cross-validation creates a test set in each fold). Thus:
y_pred = cross_val_predict(model, x_train, y_train, cv = kfold)
returns the predictions from the cross-validation internal test sets. It then seems safe to obtain the accuracy and confusion matrix with:
accuracy = metrics.accuracy_score(y_train, y_pred)
cm = metrics.confusion_matrix(y_train, y_pred)
While cross_val_predict(model, x_test, y_test, cv = kfold) runs, it seems doing this doesn't make much sense.
Section 2
From some blogs that talk about creating a confusion matrix after a cross-validation procedure (see here and here), I borrowed code that, for each fold of the cross-validation, extracts the labels and predictions from the internal test set. These labels and predictions are later used to compute the confusion matrix. Assuming I store the labels and predictions in variables called actual_classes and predicted_classes, respectively, I then run:
accuracy = metrics.accuracy_score(actual_classes, predicted_classes)
cm = metrics.confusion_matrix(actual_classes, predicted_classes)
The results are exactly the same as the ones from Section 1's equivalent code. This reinforces that cross_val_predict(model, x_train, y_train, cv = kfold) works fine.
Thus:
Does it make sense to use scikit-learn cross_val_predict() to make
predictions with unseen data in k-fold cross-validation? I would say
No, it doesn't since cross_val_predict() makes predictions with
the internal test sets from the cross-validation procedure. It
seems that to make predictions with unseen data and compute a
generalization error we would need a way to extract one of the
models from the cross-validation procedure (e.g., see this
question)
Does it make sense to use scikit-learn cross_val_predict() to
compare models? I would say Yes, it does as long as the method is
executed as shown in Section 1. The accuracy and confusion matrix
could be used to make comparisons against other models.
Any comment is appreciated! Thanks!
When fitting my data in python I'm usually doing:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
I splits my data into two chunks: one for training, other with testing.
After that I fit my data with:
model.fit(X_train,y_train)
y_pred = model.predict(X_test,y_test)
And I can get the accuracy with:
accuracy_score(y_test,y_pred)
I understand these steps.
But what is happening in sklearn.model_selection.cross_val_score? For example:
cross_val_score(estimator= model, X= X_train,y=y_train,cv=10).
Is it doing everything that I did before, but 10 times?
Do I have to split the data to train,test sets? From my understanding it splits the data, fits it, predicts the test data and gets the accuracy score. 10 times. In one line.
But I don't see how large is the train and test sets. Can I set it manually? Also are they same size with each run?
The function "train_test_split" splits the train and test set randomly with a split ratio.
While the following "cross_val_score" function does 10-Fold cross-validation.
cross_val_score(estimator= model, X= X_train,y=y_train,cv=10)
In this case, the main difference is that the 10-Fold CV does not shuffle the data, and the folds are looped in the same sequence as the original data. You should think critically if the sequence of the data matters for cross-validation, this depends on your specific application.
Choosing which validation method to use: https://stats.stackexchange.com/questions/103459/how-do-i-know-which-method-of-cross-validation-is-best
You can read the docs about K-Fold here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold
Based on my understanding, if you set cv=10, it will divide your dataset into 10 folds. So if you have 1000 rows of data, that's mean 900 will be training dataset and the rest of the 100 will be your testing dataset. Hence, you are not required to set any test_size like what you did in train_test_split.
This is a pretty straightforward question, which I don't think I can add much than the direct question: how can I combine pipeline with cross_val_score for a multiclass problem?
I was working with a multiclass problem at work (this is why I won't share any data, but one can think of the problem as something with iris dataset), where I needed to classify some texts accordingly with the topic. This is what I was doing:
pipe = Pipeline(
steps=[
("vect", CountVectorizer()),
("feature_selection", SelectKBest(chi2, k=10)),
("reg", RandomForestClassifier()),
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred))
However, I'm a little worried about overfitting (even though I'm evaluating with the test set) and I wanted to make a more rigorous analysis and add cross validation. The problem is that I don't know how to add cross_val_score in the pipeline, neither how to evaluate a multiclass problem with cross validation. I saw this answer, and so I added this to my script:
cv = KFold(n_splits=5)
scores = cross_val_score(pipe, X_train, y_train, cv = cv)
The problem is that this results in the accuracy, which is not so good when we are discussing classification problems.
Are there any alternatives? Is it possible to make cross validation and not getting only the accuracy? Or should I stick to accuracy and this isn't a problem due to any reason?
I know the question got too 'broad', and it's actually not only about cross validation, I hope this is not an issue.
Thanks in advance
It is almost always advisable to use cross validation to choose your model/hyperparameters, then to use an independent hold out test set to evaluate the performance of the model.
The good news is that you can do exactly what you wish to do, all within scikit-learn! Something like this:
pipe = Pipeline(
steps=[
("vect", CountVectorizer()),
("feature_selection", SelectKBest(chi2, k=10)),
("reg", RandomForestClassifier())])
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
'feature_selection__k': np.linspace(4, 16, 4), # Test different number of features in SelectKBest
'reg__n_estimators': [10, 30, 50, 100, 200], # n_estimators in RandomForestClassifier
'reg__min_samples_leaf': [2, 5, 10, 50] # min_samples_leaf in RandomForestClassifier
}
# This defines the grid search with "Area Under the ROC Curve" as the scoring metric to use.
# More options here: https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
search = GridSearchCV(pipe, param_grid, scoring='roc_auc',)
search.fit(X_train, y_train)
print("Best parameter (CV score={:3f}:".format(search.best_score_))
print(search.best_params_)
See here for even more details.
And if you want to define your own scoring metric for multi-class problems rather than using AUC or some other default scoring metric, see the documentation under the scoring parameter on this page for more, but that's all I can recommend not knowing what metric you're trying to optimize.
I've read a bit about integrating scaling with cross-fold validation and hyperparameter tuning without risking data leaks. The most sensical solution I've found (according to my knowledge) involves creating a pipeline that includes the scalar and GridSeachCV, for when you want to grid search and cross-fold validate. I've also read that, even when using cross-fold validation, it is useful to, at the very beginning, create a hold-out test set for an additional, final evaluation of your model after hyperparameter tuning. Putting that all together looks like this:
# train, test, split, unscaled data to create a final test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# instantiate pipeline with scaler and model, so that each training set
# in each fold is fit to the scalar and each training/test set in each fold
# is respectively transformed by fit scalar, preventing data leaks between each test/train
pipe = Pipeline([('sc', StandardScaler()),
('knn', KNeighborsClassifier())
])
# define hypterparameters to search
params = {'knn_n_neighbors': [3, 5, 7, 11]}
# create grid
search = GridSearchCV(estimator=pipe,
param_grid=params,
cv=5,
return_train_Score=True)
search.fit(X_train, y_train)
Assuming my understanding and the above process is correct, my question is what's next?
My guess is that we:
fit X_train to our scaler
transform X_train and X_test with our scaler
train a new model using X_train and our newly discovered best parameters from the grid search process
test the new model with our very first holdout-test set.
Presumably, because the Gridsearch evaluated models with scaling based on various slices of the data, the difference in values from scaling our final, whole train and test data should be fine.
Finally, when it is time to process completely new data points through our production model, do those datapoints need to be transformed according to the scalar fit to our original X_train?
Thank you for any help. I hope I am not completely misunderstanding fundamental aspects of this process.
Bonus Question:
I've seen example code like above from a number of sources. How does pipeline know to fit the scalar to the crossfold's training data, then transform the training and test data? Usually we have to define that process:
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)
GridSearchCV will help you find the best set of hyperparameter according to your pipeline and dataset. In order to do that it will use cross validation (split the your train dataset into 5 equal subset in you case). This means that your best_estimator will be trained on 80% of the train set.
As you know the more data a model see, the better its result is. Therefore once you have the optimal hyperparameters, it is wise to retrain the best estimator on all your training set and assess its performance with the test set.
You can retrain the best estimator using the whole train set by specifying the parameter refit=True of the Gridsearch and then score your model on the best_estimator as follows:
search = GridSearchCV(estimator=pipe,
param_grid=params,
cv=5,
return_train_Score=True,
refit=True)
search.fit(X_train, y_train)
tuned_pipe = search.best_estimator_
tuned_pipe.score(X_test, y_test)
everyone.
So, I am relatively new to Python and I am trying to predict a numeric variable based on 10 different numeric inputs. In particular, I am trying to apply multiple linear regression, but would like to add Monte Carlo cross-validation in the train-test-validation phase. So, I wrote a code that looks like this:
#I have imported libraries
#imported the dataset
#then created X and Y df.
#then split the data into training and testing, with validation parameters as follows:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=np.random.randint(1000), test_size=0.3)
# I have used np.random.randint(1000) as a Monte Carlo cross validation.
The code used for regression is:
#Linear Regression Model
regressor = linear_model.LinearRegression()
regressor.fit(X_train, Y_train)
y_predLR = regressor.predict(X_test)
lin_mse = mean_squared_error(y_predLR, Y_test)
lin_rmse = np.sqrt(lin_mse)
My question is: is this the right way to apply Monte Carlo cross validation?
After this, I applied MLR, and with each run of the code, the R squared, MSE and other values change, so I am guessing the Monte Carlo worked. If so, is there any way to get the same results with each run, but at the same time to use MCCV?
Moreover, the goal is to also develop an ANN model (also with Monte Carlo), and eventually to compare MLR and ANN, and then make predictions for the future period using the best developed model. I read someplace that MCCV can not be used when making predictions, is this right?
Many thanks for your time.
In order to apply MCCV you should run the process of randomly generating (without replacement) the training set and the test set multiple times.
So, roughly speaking, you need to insert your code (generation of training/test sets, learning and prediction) inside a for loop.
Note that the partitions are generated independently for each run, therefore the same data point can appear multiple times in the training (test) set, which is, in fact, the significant difference with k-fold cross validation.