Models evaluation and parameter tuning with CV - scikit-learn

I try to compare three models SVM RandomForest and LogisticRegression.
I have an imbalance dataset. First i split it to with a 80% - 20% ratio to train and test set. I set the stratify=y.
Next, i used StratifiedKfold only on train set. What i try to do now is fit the models and choose the best one. Also i want to use grid search for each one of the models to find the best parameters.
My code until now is the next
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, shuffle=True, stratify=y, random_state=42)
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=21)
for train_index, test_index in skf.split(X_train, y_train):
X_train_folds, X_test_folds = X_train[train_index], X_train[test_index]
y_train_folds, y_test_folds = y_train[train_index], y_train[test_index]
X_train_2, X_test_2, y_train_2, y_test_2 = X[train_index], X[test_index], y[train_index], y[test_index]
How can i fit a model usin all the folds? How can i gridsearch? Should i have a doulbe loop? can you help?

You can use scikit-learn's GridSearchCV.
You will find an example here of how to evaluate the performance of the various models and assess the statistical significance of the results.

Related

How to save prediction result from a ML model (SVM, kNN) using sklearn

I have some label data and I am using the classification ML model (SVM, kNN) to train and test the dataset.
My input features look like:
(442, 443, 0.608923884514436), (444, 443, 0.6418604651162789)
The label looks like:
0, 1
Then I used sklearn to train and test (after splitting the dataset 80% for train and 20% for the test). Code sample is given below:
classifiers = [
SVC(),
KNeighborsClassifier(n_neighbors=5)]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
trainingData = X_train
trainingScores = y_train
for item in classifiers:
print(item)
clf = item
clf.fit(trainingData, trainingScores)
y_pred = clf.predict(X_test)
print("Accuracy Scor:")
print(accuracy_score(y_pred, y_test))
print("Confusion Matrix:")
print(confusion_matrix(y_pred, y_test))
print("Classification Report:")
print(classification_report(y_pred, y_test))
The SVC Accuracy Scor: 0.6639580602883355
The kNN Accuracy Scor: 0.7171690694626475
I can guess that the model is predicting some data correctly. My questions are
How can I save the prediction data including the label given by the model in a CSV file.
Is it possible to use the cross-validation concept here? For example, if I want to apply 5 cross-validations. Then, how can I do that?
import pandas as pd
labels_df = pd.DataFrame(y_pred ,columns=["predicted_label"])
labels_df.to_csv(r'Path where you want to store the exported CSV file\File Name.csv',index = False)
for second question check this

Random subsets of a dataset

I would like to compare the classification performance (accuracy) of different classifiers (e.g. CNN, SVM.....), depending on the size of the training data set.
Given is a dataset of images (e.g., MNIST), from which 80% of the images are randomly determined but in compliance with class balance. Subsequently, 80% of the images for the next smaller subset are to be determined from this subset in the same way again. This is repeated until finally a small training amout of about 1000 images is reached.
Each of the classifiers should now be trained with each these subsets.
The aim is to be able to make a statement like for example that from a training size of 5000 images the classifier A is significantly better than classifier B.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size= 0.2, stratify=y)
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train, y_train, random_state=0, test_size= 0.2, stratify=y_train)
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X_train_2, y_train_2, random_state=0, test_size= 0.8, stratify=y_train_2)
.....
.....
.....
My problem is that I am not sure if this is really random sampling when I use the above code. Would it better to get the subsets, e.g. using numpy.random.randint?
For any help, I would be very grateful.

What does this Code mean? (Train Test Split Scikitlearn)

Everywhere I go I see this code. Need help understanding this.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,testsize = 0.20)
what does X_train, X_test, y_train, y_test mean in this context which should I put in fit() and predict()
As the documentation says, what train_test_split does is: Splits arrays or matrices into random train and test subsets. You can find it here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. I believe the right keyword argument is test_size instead of testsize and it represents the proportion of the dataset to include in the test split if it is float or the absolute number of test samples if is is an int.
X and y are the sequence of indexables with same length / shape[0], so basically the arrays/lists/matrices/dataframes to be split.
So, all in all, the code splits X and y into random train and test subsets (X_train and X_test for X and y_train and y_test for y). Each test subset should contain 20% of the original array entries as test samples.
You should pass the _train subsets to fit() and the _test subsets to predict(). Hope this helps~
In simple terms, train_test_split divides your dataset into training dataset and validation dataset.
The validation set is used to evaluate a given model.
So in this case validation dataset gives us idea about model performance.
X_train, X_test, y_train, y_test = train_test_split(X,y,testsize = 0.20)
The above line splits the data into 4 parts
X_train - training dataset
y_train - o/p of training dataset
X_test - validation dataset
y_test - o/p of validation dataset
and testsize = 0.2 means you'll have 20% validation data and 80% training data
`Basically this code split your data into two part.
is used for training
is for testing
And with the help of the test_size variable you can set the size of testing data
After dividing data into two part you have to fit training data into your model with fit() method.
`

Cross-validation in sklearn: do I need to call fit() as well as cross_val_score()?

I would like to use k-fold cross validation while learning a model. So far I am doing it like this:
# splitting dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(dataset_1, df1['label'], test_size=0.25, random_state=4222)
# learning a model
model = MultinomialNB()
model.fit(X_train, y_train)
scores = cross_val_score(model, X_train, y_train, cv=5)
At this step I am not quite sure whether I should use model.fit() or not, because in the official documentation of sklearn they do not fit but just call cross_val_score as following (they do not even split the data into training and test sets):
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
I would like to tune the hyper parameters of the model while learning the model. What is the right pipeline?
If you want to do hyperparameter selection then look into RandomizedSearchCV or GridSearchCV. If you want to use the best model afterwards, then call either of these with refit=True and then use best_estimator_.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
log_params = {'penalty': ['l1', 'l2'], 'C': [1E-7, 1E-6, 1E-6, 1E-4, 1E-3]}
clf = LogisticRegression()
search = RandomizedSearchCV(clf, scoring='average_precision', cv=10,
n_iter=10, param_distributions=log_params,
refit=True, n_jobs=-1)
search.fit(X_train, y_train)
clf = search.best_estimator_
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
Your second example is right for doing the cross validation. See the example here: http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics
The fitting will be done inside the cross_val_score function, you don't need to worry about this beforehand.
[Edited] If, besides cross validation, you want to train a model, you can call model.fit() afterwards.

How to run only one fold of cross validation in sklearn?

I have he following code to run a 10-fold cross validation in SkLearn:
cv = model_selection.KFold(n_splits=10, shuffle=True, random_state=0)
scores = model_selection.cross_val_score(MyEstimator(), x_data, y_data, cv=cv, scoring='mean_squared_error') * -1
For debugging purposes, while I am trying to make MyEstimator work, I would like to run only one fold of this cross-validation, instead of all 10. Is there an easy way to keep this code but just say to run the first fold and then exit?
I would still like that data is split into 10 parts, but that only one combination of that 10 parts is fitted and scored, instead of 10 combinations.
No, not with cross_val_score I suppose. You can set n_splits to minimum value of 2, but still that will be 50:50 split of train, test which you may not want.
If you want maintain a 90:10 ration and test other parts of code like MyEstimator(), then you can use a workaround.
You can use KFold.split() to get the first set of train and test indices and then break the loop after first iteration.
cv = model_selection.KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(x_data):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = x_data[train_index], x_data[test_index]
y_train, y_test = y_data[train_index], y_data[test_index]
break
Now use this X_train, y_train to train the estimator and X_test, y_test to score it.
Instead of :
scores = model_selection.cross_val_score(MyEstimator(),
x_data, y_data,
cv=cv,
scoring='mean_squared_error')
Your code becomes:
myEstimator_fitted = MyEstimator().fit(X_train, y_train)
y_pred = myEstimator_fitted.predict(X_test)
from sklearn.metrics import mean_squared_error
# I am appending to a scores list object, because that will be output of cross_val_score.
scores = []
scores.append(mean_squared_error(y_test, y_pred))
Rest assured, cross_val_score will be doing this only internally, just some enhancements for parallel processing.

Resources