Error message on attempting to fit training data using GridSearch function - python-3.x

from sklearn.preprocessing import PolynomialFeatures
polyreg = PolynomialFeatures(degree = 4)
param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
grid_search_polyreg = GridSearchCV(polyreg, param_grid, cv = 5)
grid_search_polyreg.fit(x_train, y_train)
grid_search_polyreg.score(x_test, y_test)
print("Best Parameters for polynomial regression:
{}".format(grid_search_polyreg.best_params_))
print("Best Score for polynomial regression:
{:.2f}".format(grid_search_polyreg.best_score_))
TypeError: If no scoring is specified, the estimator passed should
have a 'score' method. The estimator PolynomialFeatures(degree=4,
include_bias=True, interaction_only=False) does not.
1)I understand that alpha is not a parameter for polynomial features. But when I tried to remove alpha and fit the data it did not work.
2) Does that mean that I am not supposed to use grid search for getting scores of KNN Regressor, Linear and kernel SVM?
I am new to python and any suggestion is much appreciated. Thanks in advance.

sklearn.preprocessing.PolynomialFeatures() doesn't have a scoring function. It's not actually an estimator or machine learning model, it just transforms a matrix. You can have it as part of your pipeline and test its parameters, but you have to pass an actual estimator with a scoring function to GridSearchCV.
Fitting to data has a different meaning when your dealing with transformers vs estimators, only in the latter case does it mean "train".

Related

oversampling (SMOTE) does not work properly when fitted inside a pipeline

I have an imbalanced classification problem and I am using make_pipeline from imblearn
So the steps are the following:
kf = StratifiedKFold(n_splits=10, random_state=42, shuffle=True)
params = {
'max_depth': [2,3,5],
# 'max_features':['auto', 'sqrt', 'log2'],
# 'min_samples_leaf': [5,10,20,50,100,200,300],
'n_estimators': [10,25,30,50]
# 'bootstrap': [True, False]
}
from imblearn.pipeline import make_pipeline
imba_pipeline = make_pipeline(SMOTE(random_state = 42), RobustScaler(), RandomForestClassifier(random_state=42))
imba_pipeline
out:Pipeline(steps=[('smote', SMOTE(random_state=42)),
('robustscaler', RobustScaler()),
('randomforestclassifier',
RandomForestClassifier(random_state=42))])
new_params = {'randomforestclassifier__' + key: params[key] for key in params}
grid_imba = GridSearchCV(imba_pipeline, param_grid=new_params, cv=kf, scoring='recall',
return_train_score=True, n_jobs=-1, verbose=2)
grid_imba.fit(X_train, y_train)
And everything is going ok and I am reaching to the end to by problem (i.e I can see the classification report)
However when I am trying to see inside the black box with eli5 with eli.explain_weights(imba_pipeline)
I get back as error
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE(random_state=42)' (type <class 'imblearn.over_sampling._smote.SMOTE'>) doesn't
I know that this Is a common problem and i have read the related questions but i am confused as the problem is occurred after the end of my classification procedure
Any suggestions?
Your pipeline has two fitted steps (+ the scaler): the SMOTE augmentation and the random forest. It looks like this is confusing the eli5 which wants to work with the assumptions that only the last layer is fitted. To get the weight explanation of the random forest you could try calling eli5 only on that layer of the pipeline with
from eli5 import explain_weights
explain_weights(imba_pipeline['randomforestclassifier'])
provided the pipeline is fitted, but in your code you were fitting the grid search so
explain_weights(grid_imba.best_estimator_['randomforestclassifier'])
would be more appropriate.
Just wanted to point out that SMOTE generally doesn't improve prediction quality. See https://arxiv.org/abs/2201.08528

What is the meaning of scikit-learn GridSearchCV best_score_ when GridSearchCV is used with KerasRegressor?

I use scikit_learn.GridSearchCV to grid search Hyperparameters for my Keras neural network (for a regression problem). The output of my neural network is a real value:
#generate a model (createModel is a function which returns a keras.Sequential model)
model = keras.wrappers.scikit_learn.KerasRegressor(build_fn=createModel)
#run the GridSearch
paramGrid = dict( epochs=[100, 250, 500], batch_size=[16, 32, 64] )
grid = sklearn.model_selection.GridSearchCV(estimator=model, param_grid=paramGrid, n_jobs=1, cv=5)
#obtain and print the result (X, y are some data)
grid_result = grid.fit(X, y)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
I don't understand what exactly is the best_score_ member in the grid search result. Is it a gap between the theoritical values and the predicted values? This best_score_ is always negative (and quite big) on my examples, it doesn't make any sense to me.
When you don't pass a specific scoring metric, GridSearchCV will use the default score method of estimator.
In your example, you did not pass a metric to your grid search instance, so it will use the default score metric of KerasRegressor, which is the mean loss of predictions (according to the source code on GitHub). Hence, since you set cv=5, grid_result.best_score_ is the average of the mean loss on all 5 folds.
I suggest you set your own performance metric by passing a value for scoring. For example:
grid = sklearn.model_selection.GridSearchCV(estimator=model, param_grid=paramGrid,
scoring='roc_auc', n_jobs=1, cv=5)
You can find a list of all the supported metrics here. You can also define your own.

scikit-learn linear regression K fold cross validation

I want to run Linear Regression along with K fold cross validation using sklearn library on my training data to obtain the best regression model. I then plan to use the predictor with the lowest mean error returned on my test set.
For example the below piece of code gives me an array of 20 results with different neg mean absolute errors, I am interested in finding the predictor which gives me this (least) error and then use that predictor on my test set.
sklearn.model_selection.cross_val_score(LinearRegression(), trainx, trainy, scoring='neg_mean_absolute_error', cv=20)
There is no such thing as "predictor which gives me this (least) error" in cross_val_score, all estimators in :
sklearn.model_selection.cross_val_score(LinearRegression(), trainx, trainy, scoring='neg_mean_absolute_error', cv=20)
are the same.
You may wish to check GridSearchCV that will indeed search through different sets of hyperparams and return the best estimator:
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
X,y = datasets.make_regression()
lr_model = LinearRegression()
parameters = {'normalize':[True,False]}
clf = GridSearchCV(lr_model, parameters, refit=True, cv=5)
best_model = clf.fit(X,y)
Note the refit=True param that ensures the best model is refit on the whole dataset and returned.

Different result roc_auc_score and plot_roc_curve

I am training a RandomForestClassifier (sklearn) to predict credit card fraud. When I then test the model and check the rocauc score i get different values when I use roc_auc_score and plot_roc_curve. roc_auc_score gives me around 0.89 and the plot_curve calculates AUC to 0.96 why is that?
The labels are all 0 and 1 as well as the predictions are 0 or 1.
CodE:
clf = RandomForestClassifier(random_state =42)
clf.fit(X_train, y_train[target].values)
pred_test = clf.predict(X_test)
print(roc_auc_score(y_test, pred_test))
clf_disp = plot_roc_curve(clf, X_test, y_test)
plt.show()
Output of the code (the roc_auc_Score is just above the graph).
You are feeding the prediction classes instead of prediction probabilities to
roc_auc_score.
From Documentation:
y_score: array-like of shape (n_samples,) or (n_samples, n_classes)
Target scores. In the binary and multilabel cases, these can be either probability estimates or non-thresholded decision values (as returned by decision_function on some classifiers).
change your code to:
clf = RandomForestClassifier(random_state =42)
clf.fit(X_train, y_train[target].values)
y_score = clf.predict_prob(X_test)
print(roc_auc_score(y_test, y_score[:, 1]))
The ROC Curve and the roc_auc_score take the prediction probabilities as input, but as I can see from your code you are providing the prediction labels. You need to fix that.

Fitting in nested cross-validation with cross_val_score with pipeline and GridSearch

I am working in scikit and I am trying to tune my XGBoost.
I made an attempt to use a nested cross-validation using the pipeline for the rescaling of the training folds (to avoid data leakage and overfitting) and in parallel with GridSearchCV for param tuning and cross_val_score to get the roc_auc score at the end.
from imblearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
std_scaling = StandardScaler()
algo = XGBClassifier()
steps = [('std_scaling', StandardScaler()), ('algo', XGBClassifier())]
pipeline = Pipeline(steps)
parameters = {'algo__min_child_weight': [1, 2],
'algo__subsample': [0.6, 0.9],
'algo__max_depth': [4, 6],
'algo__gamma': [0.1, 0.2],
'algo__learning_rate': [0.05, 0.5, 0.3]}
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
clf_auc = GridSearchCV(pipeline, cv = cv1, param_grid = parameters, scoring = 'roc_auc', n_jobs=-1, return_train_score=False)
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
outer_clf_auc = cross_val_score(clf_auc, X_train, y_train, cv = cv1, scoring = 'roc_auc')
Question 1.
How do I fit cross_val_score to the training data?
Question2.
Since I included the StandardScaler() in the pipeline does it make sense to include the X_train in the cross_val_score or should I use a standardized form of the X_train (i.e. std_X_train)?
std_scaler = StandardScaler().fit(X_train)
std_X_train = std_scaler.transform(X_train)
std_X_test = std_scaler.transform(X_test)
You chose the right way to avoid data leakage as you say - nested CV.
The thing is in nested CV what you estimate is not the score of a real estimator you can "hold in your hand", but of a non-existing "meta-estimator" which describes you model selection process as well.
Meaning - in every round of the outer cross validation (in your case represented by cross_val_score), the estimator clf_auc undergoes internal CV which selects the best model under the given fold of the external CV.
Therefore, for every fold of the external CV you are scoring a different estimator chosen by the internal CV.
For example, in one external CV fold the model scored can be one that selected the param algo__min_child_weight to be 1, and in another a model that selected it to be 2.
The score of the external CV therefore represents a more high-level score: "under the process of reasonable model selection, how well will my selected model generalize".
Now, if you want to finish the process with a real model in hand you would have to select it in some way (cross_val_score will not do that for you).
The way to do that is to now fit your internal model over the entire data.
meaning to perform:
clf_auc.fit(X, y)
This is the moment to understand what you've done here:
You have a model you can use, which is fitted over all the data available.
When you're asked "how well does that model generalizes on new data?" the answer is the score you got during your nested CV - which captured the model selection process as part of your model's scoring.
And regarding Question #2 - if the scaler is part of the pipeline, there is no reason to manipulate the X_train externally.

Resources