Does K-Fold iteratively train a model - scikit-learn

If you run cross-val_score() or cross_validate() on a dataset, is the estimator trained using all the folds at the end of the run?
I read somewhere that cross-val_score takes a copy of the estimator. Whereas I thought this was how you train a model using k-fold.
Or, at the end of the cross_validate() or cross_val_score() you have a single estimator and then use that for predict()
Is my thinking correct?

You can refer to sklearn-document here.
If you do 3-Fold cross validation,
the sklearn will split your dataset to 3 parts. (For example, the 1st part contains 1st-3rd rows, 2nd part contains 4th-6th rows, and so on)
sklearn iterate to train new model 3 times with different training set and validation set
In the first round, it combine 1st and 2nd part together and use it as training set and test the model with 3rd part.
In the second round, it combine 1st and 3rd part together and use it as training set and test the model with 2nd part.
and so on.
So, after using cross-validate, you will get three models. If you want the model objects of each round, you can add parameter return_estimato=True. The result which is the dictionary will have another key named estimator containing the list of estimator of each training.
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
lasso = linear_model.Lasso()
cv_results = cross_validate(lasso, X, y, cv=3, return_estimator=True)
print(sorted(cv_results.keys()))
#Output: ['estimator', 'fit_time', 'score_time', 'test_score']
cv_results['estimator']
#Output: [Lasso(), Lasso(), Lasso()]
However, in practice, the cross validation method is used only for testing the model. After you found the good model and parameter setting that give you the high cross-validation score. It will be better if you fit the model with the whole training set again and test the model with the testing set.

Related

is it possible to set the splitting strategy for GridSearchCv?

I'm optimizing model's hyperparameters by GridSearchCv. And because the data I'm working with is very imbalanced, I need to "choose" the manner that the algortihm splits the train/test sets in order to ensure that the underrepresented points are in both sets.
By reading scikit-learn's documentation, I have the idea that it's possible to set the splitting strategy for GridSearch but I'm not sure how or if this is the case.
I would be very grateful if someone could help me with this.
Yes, pass in the GridSearchCV as cv a StratifiedKFold object.
from sklearn.model_selection import StratifiedKFold
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
skf = StratifiedKFold(n_splits=5)
clf = GridSearchCV(svc, parameters, cv = skf)
clf.fit(iris.data, iris.target)
By default, if you are training a classification model with GridSearchCV, the default method for splitting the dataset is StratifiedKFold, that takes care of balancing the dataset according to the target variable.
If your dataset is imbalanced for some other reason (not the target variable), you can choose another criteria to perform the split. Carefully read the documentation of GridSearchCV, and select an appropriate CV splitter.
In the scikit-learn documentation of model selection, there are many Splitter Classes that you could use. Or you can define your own splitter class according to your criteria, but it would be more difficult.

What do the data analytics data set train and test variables represent?

Within the below code there are a few variables I'm confused about:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import svm,metrics,datasets
train_data=np.zeros((280,10304))
train_target=np.zeros((280))
test_data=np.zeros((120,10304))
test_target=np.zeros((120))
Can someone please explain what test_data, train_data, test_target and train_target represent and their purpose?
That's a quite weird way of naming what's commonly named:
- X_train (here train_data): inputs of your model used to train
- Y_train (here train_target): labels of the lines used to train, i.e. what your model learns to predict
- X_test (here test_data): inputs of your model used to test
- Y_test (here test_target): what you want your model to predict while testing your model
To "test" a model signify mostly to compute some metrics (accuracy/recall/...) to determine how much you are satisfied of your model once that it's trained.
Note: lines of input must have same length, and you must have the same number of lines in input and in labels when training or testing.

Data Underfitting Or Not?

Is the line of regression underfitting and if yes what can I do for accurate results? I have not been able to identify such things like if the line of regression is overfitting or underfitting or accurate so suggestions regarding those will also be appreciated. The File "Advertising.csv":-https://github.com/marcopeix/ISL-linear-regression/tree/master/data
#Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score,mean_squared_error
#reading and knowing the data
data=pd.read_csv('Advertising.csv')
#print(data.head())
#print(data.columns)
#print(data.shape)
#plotting the data
plt.figure(figsize=(10,8))
plt.scatter(data['TV'],data['sales'], c='black')
plt.xlabel('Money Spent on TV ads')
plt.ylabel('Sales')
plt.show()
#storing data into variable and shaping data
X=data['TV'].values.reshape(-1,1)
Y=data['sales'].values.reshape(-1,1)
#calling the model and fitting the model
reg=LinearRegression()
reg.fit(X,Y)
#making predictions
predictions=reg.predict(X)
#plotting the predicted data
plt.figure(figsize=(16,8))
plt.scatter(data['TV'],data['sales'], c='black')
plt.plot(data['TV'],predictions, c='blue',linewidth=2)
plt.xlabel('Money Spent on TV ads')
plt.ylabel('Sales')
plt.show()
r2= r2_score(Y,predictions)
print("R2 score is: ",r2)
print("Accuracy: {:.2f}".format(reg.score(X,Y)))
To work out if your model is underfitting (or overfitting) you need to look at the bias of the model (the distance between the output predicted by your model and the expected output). You can't (to the best of my knowledge) do it just by looking at your code, you need to evaluate your model as well (run it).
As it's a linear regression it's likely that you're underfitting.
I'd suggest splitting your data into a training set and a testing set. You can fit your model on the training set, and see how well it performs on unseen data using the testing set. A model is underfitting if it performs miserably on both the training data as well as the testing data. It's overfitting if it performs brilliantly on the training data but less well on the testing data.
Try something along the lines of:
from sklearn.model_selection import train_test_split
# This will split the data into a train set and a test set, leaving 20% (the test_size parameter) for testing
X, X_test, Y, Y_test = train_test_split(data['TV'].values.reshape(-1,1), data['sales'].values.reshape(-1,1), test_size=0.2)
# Then fit your model ...
# e.g. reg.fit(X,Y)
# Finally evaluate how well it does on the training and test data.
print("Test score " + str(reg.score(X_test, Y_test)))
print("Train score " + str(reg.score(X_test, Y_test)))
Instead of training and testing on same data.
Split your data set into 2,3 sets (train,validation,test)
You may only need to split it in 2 (train,test) use sklearn library function train_test_split
Train your model on training data. Then test on testing data and see if you get good result.
If model's training accuracy is very high but testing is very low then you may say it have overfit. Or if model don't even get high accuracy on train then it is underfitting.
Hope it will you. :)

Scikit-Learn: Avoiding Data Leakage During Cross-Validation

I've just been reading up on k-fold cross-validation and have realized that I'm inadvertently leaking data with my current preprocessing setup.
Usually, I have a train and test dataset. I do a bunch of data imputation and one-hot encoding on my entire train dataset and then run k-fold cross-validation.
The leakage comes in because, if I'm doing 5-fold cross-validation, I'm training on 80% of my train data and testing it on the remaining 20% of the train data.
I really should just be imputing the 20% based on the 80% of train (whereas I was using 100% of the data before).
1) Is this the right way to think about cross-validation?
2) I've been looking at the Pipeline class in sklearn.pipeline and it seems useful for doing a bunch of transformations and then finally fitting a model to the resulting data. However, I'm doing a bunch of stuff like "impute missing data in float64 columns with the mean", "impute all other data with the mode", etc.
There isn't an obvious transformer for this kind of imputation. How would I go about adding this step to a Pipeline? Would I just make my own subclass of BaseEstimator?
Any guidance here would be great!
1) Yes, you should impute the 20% test data using the 80% training data.
2) I wrote a blog post that answers your second question, but I'll include the core parts here.
With sklearn.pipeline, you can apply separate preprocessing rules to different feature types (e.g., numeric, categorical). In the example code below, I impute the median of numeric features before scaling them. The categorical and boolean features are imputed with the mode -- the categorical features are one-hot encoded.
You can include an estimator at the end of the pipeline for regression, classification, etc.
import numpy as np
from sklearn.pipeline import make_pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder, Imputer, StandardScaler
preprocess_pipeline = make_pipeline(
FeatureUnion(transformer_list=[
("numeric_features", make_pipeline(
TypeSelector(np.number),
Imputer(strategy="median"),
StandardScaler()
)),
("categorical_features", make_pipeline(
TypeSelector("category"),
Imputer(strategy="most_frequent"),
OneHotEncoder()
)),
("boolean_features", make_pipeline(
TypeSelector("bool"),
Imputer(strategy="most_frequent")
))
])
)
The TypeSelector portion of the pipeline assumes the object X is a pandas DataFrame. The subset of columns with the given data type are selected with TypeSelector.transform.
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
class TypeSelector(BaseEstimator, TransformerMixin):
def __init__(self, dtype):
self.dtype = dtype
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
return X.select_dtypes(include=[self.dtype])
I recommend thinking of 5-fold cross validation as simply splitting up the data into 5 parts (or folds). You hold out one fold for testing and using the other 4 together for your training set. We repeat this process another 4 times until each fold has had the chance to be tested.
For your imputation to work correctly and not be subject to contamination, you would need to determine the mean from the 4 folds used for testing, and use it to impute that value in both the training set and test set.
I like to implement the CV split with StratifiedKFold. This will ensure you have the same number of samples for each class in the folds.
To answer your question about using Pipelines, I would say you should probably subclass the BaseEstimator with your custom Imputation transformer. Inside of your loop for the CV-split, you should compute the mean from your training set then set this mean as a parameter in your transformer. Then you can call fit or transform.

Using a transformer (estimator) to transform the target labels in sklearn.pipeline

I understand that one can chain several estimators that implement the transform method to transform X (the feature set) in sklearn.pipeline. However I have a use case where I would like also transform the target labels (like transform the labels to [1...K] instead of [0, K-1] and I would love to do that as a component in my pipeline. Is it possible to that at all using the sklearn.pipeline.?
There is now a nicer way to do this built into scikit-learn; using a compose.TransformedTargetRegressor.
When constructing these objects you give them a regressor and a transformer. When you .fit() them they transform the targets before regressing, and when you .predict() them they transform their predicted targets back to the original space.
It's important to note that you can pass them a pipeline object, so they should interface nicely with your existing setup. For example, take the following setup where I train a ridge regression to predict 1 target given 2 features:
# Imports
import numpy as np
from sklearn import compose, linear_model, metrics, pipeline, preprocessing
# Generate some training and test features and targets
X_train = np.random.rand(200).reshape(100,2)
y_train = 1.2*X_train[:, 0]+3.4*X_train[:, 1]+5.6
X_test = np.random.rand(20).reshape(10,2)
y_test = 1.2*X_test[:, 0]+3.4*X_test[:, 1]+5.6
# Define my model and scalers
ridge = linear_model.Ridge(alpha=1e-2)
scaler = preprocessing.StandardScaler()
minmax = preprocessing.MinMaxScaler(feature_range=(-1,1))
# Construct a pipeline using these methods
pipe = pipeline.make_pipeline(scaler, ridge)
# Construct a TransformedTargetRegressor using this pipeline
# ** So far the set-up has been standard **
regr = compose.TransformedTargetRegressor(regressor=pipe, transformer=minmax)
# Fit and train the regr like you would a pipeline
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print("MAE: {}".format(metrics.mean_absolute_error(y_test, y_pred)))
This still isn't quite as smooth as I'd like it to be, for example you can access the regressor that contained by a TransformedTargetRegressor using .regressor_ but the coefficients stored there are untransformed. This means there are some extra hoops to jump through if you want to work your way back to the equation that generated the data.
No, pipelines will always pass y through unchanged. Do the transformation outside the pipeline.
(This is a known design flaw in scikit-learn, but it's never been pressing enough to change or extend the API.)
You could add the label column to the end of the training data, then you apply your transformation and you delete that column before training your model. That's not very pro but enough.

Resources