How to get best_estimator parameters from GridSearch using cross_val_score? - scikit-learn

I want to know the result of the GridSearch when I'm using nested cross validation with cross_val_score for convenience.
When using cross_val_score, you get an array of scores. It would be useful to receive the fitted estimator back or a summary of the chosen parameters for that estimator.
I know you can do this yourself but just implementing cross-validation manually but it is much more convenient if it can be done in conjunction with cross_val_score.
Any way to do it or is this a feature to suggest?

The GridSearchCV class in scikit-learn already does cross validation internally. You can pass any CV iterator as the cv argument of the constructor of GridSearchCV.

The answer to your question is that it is a feature to suggest. Unfortunately, you can't get the best parameters of the models fitted with nested cross-validation using cross_val_score (as of now, scikit 0.14).
See this example:
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score
digits = datasets.load_digits()
X = digits.data
y = digits.target
hyperparams = [{'fit_intercept':[True, False]}]
algo = LinearRegression()
grid = GridSearchCV(algo, hyperparams, cv=5, scoring='mean_squared_error')
# Nested cross validation
cross_val_score(grid, X, y)
grid.best_score_
[Out]:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-4c4ac83c58fb> in <module>()
15 # Nested cross validation
16 cross_val_score(grid, X, y)
---> 17 grid.best_score_
AttributeError: 'GridSearchCV' object has no attribute 'best_score_'
(Note also that the scores you get from cross_val_score are not the ones defined in scoring, here the mean squared error. What you see is the score function of the best estimator. The bug of v0.14 is described here.)

In sklearn v0.20.0 (which will be released in late 2018), the trained estimators are exposed by the function cross_validate if requested.
See here the corresponding pull-request for the new feature. Something like this will work:
from sklearn.metrics.scorer import check_scoring
from sklearn.model_selection import cross_validate
scorer = check_scoring(estimator=gridSearch, scoring=scoring)
cvRet = cross_validate(estimator=gridSearch, X=X, y=y,
scoring={'score': scorer}, cv=cvOuter,
return_train_score=False,
return_estimator=True,
n_jobs=nJobs)
scores = cvRet['test_score'] # Equivalent to output of cross_val_score()
estimators = cvRet['estimator']
If return_estimator=True, the estimators can be retrieved from the returned dictionary as cvRet['estimator']. The list stored in cvRet['test_score'] is equivalent to the output of cross_val_score. See here how cross_val_score() is implemented by means of cross_validate().

Related

Outlier elimination in a imblearn pipeline affecting both X and y

I aim to integrate outlier elimination into a machine learning pipeline with a continuous dependent variable. The challenge is to keep X and y at the same length, thus I have eliminate outliers in both datasets.
As this task turned out to be difficult or impossible using sklearn, I switched to imblearn and FunctionSampler. Inspired by the documentation, I tried the following code:
from imblearn import FunctionSampler
from imblearn.pipeline import make_pipeline
from sklearn.ensemble import IsolationForest
from sklearn.linear_model import LinearRegression
def outlier_rejection(X, y):
model = IsolationForest(max_samples=100, contamination=0.4, random_state=rng)
model.fit(X)
y_pred = model.predict(X)
return X[y_pred == 1], y[y_pred == 1]
pipe = make_pipeline(
FunctionSampler(func = outlier_rejection),
LinearRegression()
)
pipe.fit(X_train, y_train) # y_train is a continuous variable!
However, when I tried to fit the pipeline I got the error
ValueError: Unknown label type: 'continuous'
which I think is because my dependent variable is continuous.
I suspect that imblearn is only compatible with nominal data. Is that true? If yes, is there another approach to solve my problem (e.g. with classic sklearn pipeline)? If not, where did I make a mistake in the code above?

scikit-learn linear regression K fold cross validation

I want to run Linear Regression along with K fold cross validation using sklearn library on my training data to obtain the best regression model. I then plan to use the predictor with the lowest mean error returned on my test set.
For example the below piece of code gives me an array of 20 results with different neg mean absolute errors, I am interested in finding the predictor which gives me this (least) error and then use that predictor on my test set.
sklearn.model_selection.cross_val_score(LinearRegression(), trainx, trainy, scoring='neg_mean_absolute_error', cv=20)
There is no such thing as "predictor which gives me this (least) error" in cross_val_score, all estimators in :
sklearn.model_selection.cross_val_score(LinearRegression(), trainx, trainy, scoring='neg_mean_absolute_error', cv=20)
are the same.
You may wish to check GridSearchCV that will indeed search through different sets of hyperparams and return the best estimator:
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
X,y = datasets.make_regression()
lr_model = LinearRegression()
parameters = {'normalize':[True,False]}
clf = GridSearchCV(lr_model, parameters, refit=True, cv=5)
best_model = clf.fit(X,y)
Note the refit=True param that ensures the best model is refit on the whole dataset and returned.

Sklearn Logistic Regression predict_proba returning 0 or 1

I don't have any example data to share in order to replicate the problem, but perhaps someone can provide a high level answer. I've created a lot of logistic regression models in the past, and this is the first time my predict proba scores are showing up as either 1 or 0.
I'm creating a binary classifier to predict one of two labels. I've also used a couple of other algorithms, XGBClassifier and RandomForestCalssifier with the same dataset. For these, predict_proba yields the expected probability results (i.e, float values between 0 and 1).
Also, for the LogisticRegression model, I've tried a variety of parameters including all default params, yet the issue persists. Weirdly enough, using SGDClassifier with loss = 'log' or 'modified_huber' also yields the same binary predict_proba results, so I'm thinking this might be something intrinsic to the dataset, but not sure. Also, this issue only occurs if I standardize training set data. So far I've tried both StandardScaler and MinMaxScaler, same results.
Has anyone ever encountered a problem such as this?
Edit:
The LR parameters are:
LogisticRegression(C=1.7993269963183343, class_weight='balanced', dual=False,
fit_intercept=True, intercept_scaling=1, l1_ratio=.5,
max_iter=100, multi_class='warn', n_jobs=-1, penalty='elasticnet',
random_state=58, solver='saga', tol=0.0001, verbose=0,
warm_start=False)
Again, the issue only occurs when standardizing the data with either StandardScaler() or MinMaxScaler(). Which is odd because the data is not a uniform scale across all features. For instance, some features are represented as percents, others are represented as dollar values, and others are dummy coded representations.
This can happen when you do the following two things in sequence:
Fit an estimator with standardized training data and then later on,
Pass unstandardized data to the same estimator in the validation or testing phase.
Here's an example of predict_proba returning 0 or 1 using the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=123)
# Example 1 [CORRECT]
pipeline = make_pipeline(StandardScaler(), LogisticRegression())
pipeline.fit(X_train, y_train)
# Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])
print(pipeline)
y_pred = pipeline.predict_proba(X_test)
# [0.37264656 0.62735344]
print(y_pred.mean(axis=0))
# Example 2 [INCORRECT]
# Fit the model with standardized training set
X_scaled = StandardScaler().fit_transform(X_train)
model = LogisticRegression()
model.fit(X_scaled, y_train)
# Test the model with unstandardized test set
y_pred = model.predict_proba(X_test)
# [1.00000000e+000 2.48303123e-204]
print(y_pred.mean(axis=0))
Since the estimator in Example 2 was fitted on scaled data with a unit variance of 1.0 (X_scaled), the variance of the data it's being tested on (X_test) is much higher than expected. It's no surprise then that this results in very extreme probabilities.
You can prevent this from happening by wrapping your estimator within a pipeline and calling the pipeline fit method instead of the estimator's fit method (see Example 1). Doing it this way guarantees that the same transformations are applied to the data in the training, validation and testing phases.

sample_weight parameter shape error in scikit-learn GridSearchCV

Passing the sample_weight parameter to GridSearchCV raises an error due to incorrect shape. My suspicion is that cross validation is not capable of handling the split of sample_weights accordingly with the dataset.
First part: Using sample_weight as a model parameter works beautifully
Let's consider a simple example, first without GridSearch:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
dataURL = 'https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sinusoidal_data.csv'
x = pd.read_csv(dataURL, usecols=["x"]).x
y = pd.read_csv(dataURL, usecols=["y"]).y
occurrences = pd.read_csv(dataURL, usecols=["Occurrences"]).Occurrences
my_sample_weights = (1 - occurrences/10000)**3
my_sample_weights contains the importance that I assign to each observation in x, y, as the following picture shows. The points of the sinusoidal curve get higher weights than those forming the background noise.
plt.scatter(x, y, c=my_sample_weights>0.9, cmap="cool")
Let's train a neural network, first without using the information contained in my_sample_weights:
def make_model(number_of_hidden_neurons=1):
model = Sequential()
model.add(Dense(number_of_hidden_neurons, input_shape=(1,), activation='tanh'))
model.add(Dense(1, activation='linear'))
model.compile(optimizer='sgd', loss='mse')
return model
net_Not_using_sample_weight = make_model(number_of_hidden_neurons=6)
net_Not_using_sample_weight.fit(x,y, epochs=1000)
plt.scatter(x, y, )
plt.scatter(x, net_Not_using_sample_weight.predict(x), c="green")
As the following picture shows, the neural network tries to fit the shape of the sinusoidal but the background noise prevents it from a good fit.
Now, using the information of my_sample_weights , the quality of the prediction is a much better one.
Second part: Using sample_weight as a GridSearchCV parameter raises an error
my_Regressor = KerasRegressor(make_model)
validator = GridSearchCV(my_Regressor,
param_grid={'number_of_hidden_neurons': range(4, 5),
'epochs': [500],
},
fit_params={'sample_weight': [ my_sample_weights ]},
n_jobs=1,
)
validator.fit(x, y)
Trying to pass the sample_weights as a parameter gives the following error:
...
ValueError: Found a sample_weight array with shape (1000,) for an input with shape (666, 1). sample_weight cannot be broadcast.
It seems that the sample_weight vector has not been split in a similar manner to the input array.
For what is worth:
import sklearn
print(sklearn.__version__)
0.18.1
import keras
print(keras.__version__)
2.0.5
The problem is that as a standard, the GridSearch uses 3-fold cross-validation, unless explicity stated otherwise. This means that 2/3 data points of the data are used as training data and 1/3 for cross-validation, which does fit the error message. The input shape of 1000 of the fit_params doesn't match the number of training examples used for training (666). Adjust the size and the code will run.
my_sample_weights = np.random.uniform(size=666)
We developed PipeGraph, an extension to Scikit-Learn Pipeline that allows you to get intermediate data, build graph like workflows, and in particular, solve this problem (see the examples in the gallery at http://mcasl.github.io/PipeGraph )

sklearn auc ValueError: Only one class present in y_true

I searched Google, and saw a couple of StackOverflow posts about this error. They are not my cases.
I use keras to train a simple neural network and make some predictions on the splitted test dataset. But when use roc_auc_score to calculate AUC, I got the following error:
"ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.".
I inspect the target label distribution, and they are highly imbalanced. Some labels(in the total 29 labels) have only 1 instance. So it's likely they will have no positive label instance in the test label. So the sklearn's roc_auc_score function reported the only one class problem. That's reasonable.
But I'm curious, as when I use sklearn's cross_val_score function, it can handle the AUC calculation without error.
my_metric = 'roc_auc'
scores = cross_validation.cross_val_score(myestimator, data,
labels, cv=5,scoring=my_metric)
I wonder what happened in the cross_val_score, is it because the cross_val_score use a stratified cross-validation data split?
UPDATE
I continued to make some digging, but still can't find the difference behind.I see that cross_val_score call check_scoring(estimator, scoring=None, allow_none=False) to return a scorer, and the check_scoring will call get_scorer(scoring) which will return scorer=SCORERS[scoring]
And the SCORERS['roc_auc'] is roc_auc_scorer;
the roc_auc_scorer is made by
roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True,
needs_threshold=True)
So, it's still using the roc_auc_score function. I don't get why cross_val_score behave differently with directly calling roc_auc_score.
I think your hunch is correct. The AUC (area under ROC curve) needs a sufficient number of either classes in order to make sense.
By default, cross_val_score calculates the performance metric one each fold separately. Another option could be to do cross_val_predict and compute the AUC over all folds combined.
You could do something like:
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
class ProbaEstimator(LogisticRegression):
"""
This little hack needed, because `cross_val_predict`
uses `estimator.predict(X)` internally.
Replace `LogisticRegression` with whatever classifier you like.
"""
def predict(self, X):
return super(self.__class__, self).predict_proba(X)[:, 1]
# some example data
X, y = make_classification()
# define your estimator
estimator = ProbaEstimator()
# get predictions
pred = cross_val_predict(estimator, X, y, cv=5)
# compute AUC score
roc_auc_score(y, pred)

Resources