Hyper parameter tuning on pipeline object - scikit-learn

I have this pipeline,
pl = Pipeline([
('union', FeatureUnion(
transformer_list = [
('numeric_features', Pipeline([
("selector", get_numeric_data),
])),
('text_features', Pipeline([
("selector",get_text_data),
("vectorizer", HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC,non_negative=True, norm=None, binary=False, ngram_range=(1,2))),
('dim_red', SelectKBest(chi2, chi_k))
]))
])), ("clf",LogisticRegression())
])
When I try to do
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
c_space = np.logspace(-5, 8, 15)
param_grid = {"C": c_space,"penalty": ['l1', 'l2']}
logreg_cv = GridSearchCV(pl,param_grid=param_grid,cv=5)
logreg_cv.fit(X_train,y_train)
It throws me
ValueError: Invalid parameter penalty for estimator
Pipeline(memory=None,
steps=[('union', FeatureUnion(n_jobs=1,
transformer_list=[('numeric_features', Pipeline(memory=None,
steps=[('selector', FunctionTransformer(accept_sparse=False,
func= at 0x00000190ECB49488>, inv_kw_args=None,
inverse_func=None, kw_args=None, pass_y=...ty='l2', random_state=None,
solver='liblinear', tol=0.0001,
verbose=0, warm_start=False))]). Check the list of available parameters
with estimator.get_params().keys().
Although "C" and "penalty" legit parameters in this case. Please help me hoe to go about it.

"C" and "penalty" are legit parameters of LogisticRegression, not Pipeline object that you send to GridSearchCV.
Your pipeline currently have two components, "union" and "clf". Now the pipeline dont know which part to send the paramters. You need to append these names used in pipeline with params, so that it can identify them and send them to correct object.
Do this:
param_grid = {"clf__C": c_space,"clf__penalty": ['l1', 'l2']}
Note that there are two underscores in between the name of object in pipeline and the parameters.
Its mentioned in the documentation of Pipeline and FeatureUnion here:
Parameters of the estimators in the pipeline can be accessed using the
__ syntax
With various examples to demonstrate the usage.
Following this, if you want to say change the ngram_range of HashingVectorizer, you would do this:
"union__text_features__vectorizer__ngram_range" : [(1,3)]

Related

Why does sklearn.model_selection.GridSearchCV not have a consistent result?

I change the code from https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html a little bit, which looks like this:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear','rbf'), 'C':[10,20, 15, 4]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(iris.data, iris.target)
clf.best_params_
Then the result is:
{'C': 10, 'kernel': 'rbf'}
But if I change the code to:
parameters = {'kernel':('linear','rbf'), 'C':[4, 10,20, 15]}
You can see the only change is the sequence of C list. But the result is:
{'C': 4, 'kernel': 'rbf'}
It looks like GridSearchCV just uses the first parameter combination.
So I have a few questions about this:
In this case, scoring is the default (None), so what function actually uses here? And why the above situation happens?
As far as I know, when we use LatentDirichletAllocation and GridSearchCV, the scoring function is log likelihood even scoring=None. If I understand correctly, then GridSearchCV can automatically pick a scoring function when it combines different models?

Using GridSearchCV with xgbranker

I am trying to use GridSearchCV with xgbranker estimator from xgboost. I am trying to use GroupKFold and passing qid (group_ids) parameter to the grid fit method but it's not straightforward. After a bit of hit and trial with solutions already suggested on the web, I finally zeroed on a approach. I am still getting an error which seems to be in the scoring method passed. Any help or working example would be great?
Sample code:
from sklearn.model_selection import GroupKFold, GridSearchCV
from sklearn.metrics import make_scorer, ndcg_score
ndcg_scorer = make_scorer(ndcg_score)
param_grid = {
'learning_rate': [0.001, 0.01, 0.02],
'n_estimators': [10, 50]
}
splits = 3
gkf = GroupKFold(n_splits=splits)
cv_group = gkf.split(X_train, y_train, qids_train)
def group_gen():
for ids,_ in cv_group:
yield ids
grid = GridSearchCV(my_model, param_grid, cv=splits, scoring=ndcg_scorer, refit=False)
grid.fit(X_train, y_train, qid=next(group_gen()))
I get below error:
ValueError: Only ('multilabel-indicator', 'continuous-multioutput', 'multiclass-multioutput') formats are supported. Got multiclass instead
The error seems to be related to the scoring method you use, but you didn't share anything about your data. so it's hard to say what exactly is the problem.
It seems to me that you're using for the scoring a method that expects something else then you're providing as a label.

Scikit learn GridSearchCV with pipeline with custom transformer

I'm trying to perform a GridSearchCV on a pipeline with a custom transformer. The transformer enriches the features "year" and "odometer" polynomially and one hot encodes the rest of the features. The ML model is a simple linear regression model.
custom transformer code:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
class custom_poly_features(TransformerMixin, BaseEstimator):
def __init__(self, degree = 2, poly_features = ['year', 'odometer']):
self.degree_ = degree
self.poly_features_ = poly_features
def fit(self, X, y=None):
# Return the classifier
return self
def transform(self, X, y=None):
poly_feat = PolynomialFeatures(degree=self.degree_)
OneHot = OneHotEncoder(sparse=False)
not_poly_features = list(set(X.columns) - set(self.poly_features_))
poly = poly_feat.fit_transform(X[self.poly_features_].to_numpy())
poly = np.hstack([poly, OneHot.fit_transform(X[not_poly_features].to_numpy())])
return poly
def get_params(self, deep=True):
return {"degree": self.degree_, "poly_features": self.poly_features_}
pipeline & gridsearch code:
#create pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
poly_pipeline = Pipeline(steps=[("cpf", custom_poly_features()), ("lin_reg", LinearRegression(n_jobs=-1))])
#perform gridsearch
from sklearn.model_selection import GridSearchCV
param_grid = {"cpf__degree": [3, 4, 5]}
search = GridSearchCV(poly_pipeline, param_grid, n_jobs=-1, cv=3)
search.fit(X_train_ordinal, y_train)
The custom transformer itself works fine and the pipeline also works (although the score is not great, but that is not the topic here).
poly_pipeline.fit(X_train, y_train).score(X_test, y_test)
Output:
0.543546844381771
However, when I perform the gridsearch, the scores are all nan values:
search.cv_results_
Output:
{'mean_fit_time': array([4.46928191, 4.58259885, 4.55605125]),
'std_fit_time': array([0.18111937, 0.03305779, 0.02080789]),
'mean_score_time': array([0.21119197, 0.13816587, 0.11357466]),
'std_score_time': array([0.09206233, 0.02171508, 0.02127906]),
'param_custom_poly_features__degree': masked_array(data=[3, 4, 5],
mask=[False, False, False],
fill_value='?',
dtype=object),
'params': [{'custom_poly_features__degree': 3},
{'custom_poly_features__degree': 4},
{'custom_poly_features__degree': 5}],
'split0_test_score': array([nan, nan, nan]),
'split1_test_score': array([nan, nan, nan]),
'split2_test_score': array([nan, nan, nan]),
'mean_test_score': array([nan, nan, nan]),
'std_test_score': array([nan, nan, nan]),
'rank_test_score': array([1, 2, 3])}
Does anyone know what the problem is? The transformer and the pipeline work fine on their own after all.
To debug searches in general, set error_score='raise', so that you get a full error traceback.
Your issue appears to be data-dependent; I can run this just fine on a custom dataset. That suggests to me that the comment by #Sanjar Adylov not only highlights an important issue, but the issue for your data: the train folds sometimes contain different values in some categorical feature(s) than the test folds, and so the one-hot encodings end up with different numbers of features, and the linear model justifiably breaks.
So the fix there is also as Sanjar says: instantiate, store as attributes, and fit the two transformers and in your fit method, and use their transform methods in your transform method.
You will find there is another big issue: all the scores in cv_results_ are the same. This is because you can't actually set the hyperparameters correctly, because in __init__ you've used mismatching names (degree as the parameter but degree_ as the attribute). Read more in the developer guide. (I think you can get around this by editing set_params similar to how you edited get_params, but it would be much easier to actually rely on the BaseEstimator versions of those and just match the parameter names to the attribute names.)
Also, note that setting a parameter default to a list can have surprising effects. Consider alternatives to the default of poly_features in __init__.
class custom_poly_features(TransformerMixin, BaseEstimator):
def __init__(self, degree=2, poly_features=['year', 'odometer']):
self.degree = degree
self.poly_features = poly_features
def fit(self, X, y=None):
self.poly_feat = PolynomialFeatures(degree=self.degree)
self.onehot = OneHotEncoder(sparse=False)
self.not_poly_features_ = list(set(X.columns) - set(self.poly_features))
self.poly_feat.fit(X[self.poly_features])
self.onehot.fit(X[self.not_poly_features_])
return self
def transform(self, X, y=None):
poly = self.poly_feat.transform(X[self.poly_features])
poly = np.hstack([poly, self.onehot.transform(X[self.not_poly_features_])
return poly
There are some additional things you might want to add, like checks for whether poly_features or not_poly_features_ is empty (which would break the corresponding transformer).
Finally, your custom estimator is just doing what a ColumnTransformer is meant to do. I think the only reason to prefer yours is if you need to search over which columns get which treatment; I don't think that's easy to do with a ColumnTransformer.
custom_poly = ColumnTransformer(
transformers=[('poly', PolynomialFeatures(), ['year', 'odometer'])],
remainder=OneHotEncoder(),
)
param_grid = {"cpf__poly__degree": [3, 4, 5]}

How to combine a pipeline for all types of features, for categorical features and numerical features in one ColumnTransformerr?

Im trying to create a pipeline that combines :
Pipeline for all kinds of features, no matter the type (cleaning incorrect data by feature)
Pipeline for categorical features (categorical imputer)
Pipeline for numerical features (numerical imputer)
in a sklearn.compose.ColumnTransformer¶.
This here is a piece of code for what I'm trying to do
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
alltypes = Pipeline([
('column_name_normalizer',ColumnNameNormalizer()),
('column_incorrect_data_cleaner',ColumnIncorrectDataCleaner(some_parameter),
])
num_pipeline = Pipeline([
('imputer',CustomNumImputer(some_parameter)), # remplir les valeurs manquants
])
cat_pipeline = Pipeline([
("cat", CustomCatImputer(some_parameter))
])
full_pipeline = ColumnTransformer([
("alltypes",alltypes,allcolumns),
("num", num_pipeline, numfeat),
("cat",cat_pipeline,catfeat)
])
try:
X = pd.DataFrame(full_pipeline.fit_transform(X).toarray())
except AttributeError:
X = pd.DataFrame(full_pipeline.fit_transform(X))
However in the end I get a dataframe with more number of features than at the beginning which is due to the fact that all the features from the pipelines are concatenated, instead of an operator UNION being performed on them:
For instance I want to do some transformations on all features, then do some transformations on categorical features, and do some transformations on numerical features, but I want the outputing dataframe to be always the same size.
Do you know how can I fix this?
You need to combine the sequential power of Pipeline, e.g.
cat_num_split = ColumnTransformer([
("num", num_pipeline, numfeat),
("cat", cat_pipeline, catfeat),
])
full_pipeline = Pipeline([
("alltypes", alltypes),
("cat_num", cat_num_split),
)]
There is a catch here: the alltypes transformer will result in a numpy array without information about which columns are which; your cat_num_split feature lists numfeat and catfeat will rely on your knowledge of which columns are which, and cannot use the column names.
An alternative, that doesn't run into the feature name issue, is to switch the order here.
num_full_pipe = Pipeline([
("common", alltypes),
("num", num_pipeline),
])
cat_full_pipe = Pipeline([
("common", alltypes),
("cat", cat_pipeline),
])
full_pipeline = ColumnTransformer([
("num", num_full_pipe, numfeat),
("cat", cat_full_pipe, catfeat),
])
See also Consistent ColumnTransformer for intersecting lists of columns.

best-found PCA estimator to be used as the estimator in RFECV

This works (mostly from the demo sample at sklearn):
print(__doc__)
# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from scipy.stats import uniform
lregress = LinearRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('regress', lregress)])
# Plot the PCA spectrum
pca.fit(data_num)
plt.figure(1, figsize=(16, 9))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_')
# Prediction
n_components = uniform.rvs(loc=1, scale=data_num.shape[1], size=50,
random_state=42).astype(int)
# Parameters of pipelines can be set using ‘__’ separated parameter names:
estimator_pca = GridSearchCV(pipe,
dict(pca__n_components=n_components)
)
estimator_pca.fit(data_num, data_labels)
plt.axvline(estimator_pca.best_estimator_.named_steps['pca'].n_components,
linestyle=':', label='n_components chosen ' +
str(estimator_pca.best_estimator_.named_steps['pca'].n_components))
plt.legend(prop=dict(size=12))
plt.plot(np.cumsum(pca.explained_variance_ratio_), linewidth=1)
plt.show()
And this works:
from sklearn.feature_selection import RFECV
estimator = LinearRegression()
selector = RFECV(estimator, step=1, cv=5, scoring='explained_variance')
selector = selector.fit(data_num_pd, data_labels)
print("Selected number of features : %d" % selector.n_features_)
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(selector.grid_scores_) + 1), selector.grid_scores_)
plt.show()
but this gives me the error "RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes" on the line "selector1 = selector1.fit"
pca_est = estimator_pca.best_estimator_
selector1 = RFECV(pca_est, step=1, cv=5, scoring='explained_variance')
selector1 = selector1.fit(data_num_pd, data_labels)
print("Selected number of features : %d" % selector1.n_features_)
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(selector1.grid_scores_) + 1), selector1.grid_scores_)
plt.show()
How do I get my best-found PCA estimator to be used as the estimator in RFECV?
This is a known issue in pipeline design. Refer to the github page here:
Accessing fitted attributes:
Moreover, some fitted attributes are used by meta-estimators;
AdaBoostClassifier assumes its sub-estimator has a classes_ attribute
after fitting, which means that presently Pipeline cannot be used as
the sub-estimator of AdaBoostClassifier.
Either meta-estimators such as AdaBoostClassifier need to be
configurable in how they access this attribute, or meta-estimators
such as Pipeline need to make some fitted attributes of sub-estimators
accessible.
Same goes for other attributes like coef_ and feature_importances_. They are parts of last estimator so not exposed by pipeline.
Now you can try to follow the last para here and try to circumvent this to include it in pipeline, by doing something like this:
class Mypipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
And then using this new pipeline class in your code instead of original Pipeline.
This should work in most cases but not yours. You are doing feature reduction using PCA inside the pipeline. But want to do feature selection using RFECV. This in my opinion is not a good combination.
RFECV will keep on decreasing the number of features to be used. But the n_components in your best selected pca from above grid-search will be fixed. Then it will again throw an error when number of features become less than n_components. You cannot do anything in that case.
So I would advise you to think over your use case and code.

Resources