I'm trying to perform a GridSearchCV on a pipeline with a custom transformer. The transformer enriches the features "year" and "odometer" polynomially and one hot encodes the rest of the features. The ML model is a simple linear regression model.
custom transformer code:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
class custom_poly_features(TransformerMixin, BaseEstimator):
def __init__(self, degree = 2, poly_features = ['year', 'odometer']):
self.degree_ = degree
self.poly_features_ = poly_features
def fit(self, X, y=None):
# Return the classifier
return self
def transform(self, X, y=None):
poly_feat = PolynomialFeatures(degree=self.degree_)
OneHot = OneHotEncoder(sparse=False)
not_poly_features = list(set(X.columns) - set(self.poly_features_))
poly = poly_feat.fit_transform(X[self.poly_features_].to_numpy())
poly = np.hstack([poly, OneHot.fit_transform(X[not_poly_features].to_numpy())])
return poly
def get_params(self, deep=True):
return {"degree": self.degree_, "poly_features": self.poly_features_}
pipeline & gridsearch code:
#create pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
poly_pipeline = Pipeline(steps=[("cpf", custom_poly_features()), ("lin_reg", LinearRegression(n_jobs=-1))])
#perform gridsearch
from sklearn.model_selection import GridSearchCV
param_grid = {"cpf__degree": [3, 4, 5]}
search = GridSearchCV(poly_pipeline, param_grid, n_jobs=-1, cv=3)
search.fit(X_train_ordinal, y_train)
The custom transformer itself works fine and the pipeline also works (although the score is not great, but that is not the topic here).
poly_pipeline.fit(X_train, y_train).score(X_test, y_test)
Output:
0.543546844381771
However, when I perform the gridsearch, the scores are all nan values:
search.cv_results_
Output:
{'mean_fit_time': array([4.46928191, 4.58259885, 4.55605125]),
'std_fit_time': array([0.18111937, 0.03305779, 0.02080789]),
'mean_score_time': array([0.21119197, 0.13816587, 0.11357466]),
'std_score_time': array([0.09206233, 0.02171508, 0.02127906]),
'param_custom_poly_features__degree': masked_array(data=[3, 4, 5],
mask=[False, False, False],
fill_value='?',
dtype=object),
'params': [{'custom_poly_features__degree': 3},
{'custom_poly_features__degree': 4},
{'custom_poly_features__degree': 5}],
'split0_test_score': array([nan, nan, nan]),
'split1_test_score': array([nan, nan, nan]),
'split2_test_score': array([nan, nan, nan]),
'mean_test_score': array([nan, nan, nan]),
'std_test_score': array([nan, nan, nan]),
'rank_test_score': array([1, 2, 3])}
Does anyone know what the problem is? The transformer and the pipeline work fine on their own after all.
To debug searches in general, set error_score='raise', so that you get a full error traceback.
Your issue appears to be data-dependent; I can run this just fine on a custom dataset. That suggests to me that the comment by #Sanjar Adylov not only highlights an important issue, but the issue for your data: the train folds sometimes contain different values in some categorical feature(s) than the test folds, and so the one-hot encodings end up with different numbers of features, and the linear model justifiably breaks.
So the fix there is also as Sanjar says: instantiate, store as attributes, and fit the two transformers and in your fit method, and use their transform methods in your transform method.
You will find there is another big issue: all the scores in cv_results_ are the same. This is because you can't actually set the hyperparameters correctly, because in __init__ you've used mismatching names (degree as the parameter but degree_ as the attribute). Read more in the developer guide. (I think you can get around this by editing set_params similar to how you edited get_params, but it would be much easier to actually rely on the BaseEstimator versions of those and just match the parameter names to the attribute names.)
Also, note that setting a parameter default to a list can have surprising effects. Consider alternatives to the default of poly_features in __init__.
class custom_poly_features(TransformerMixin, BaseEstimator):
def __init__(self, degree=2, poly_features=['year', 'odometer']):
self.degree = degree
self.poly_features = poly_features
def fit(self, X, y=None):
self.poly_feat = PolynomialFeatures(degree=self.degree)
self.onehot = OneHotEncoder(sparse=False)
self.not_poly_features_ = list(set(X.columns) - set(self.poly_features))
self.poly_feat.fit(X[self.poly_features])
self.onehot.fit(X[self.not_poly_features_])
return self
def transform(self, X, y=None):
poly = self.poly_feat.transform(X[self.poly_features])
poly = np.hstack([poly, self.onehot.transform(X[self.not_poly_features_])
return poly
There are some additional things you might want to add, like checks for whether poly_features or not_poly_features_ is empty (which would break the corresponding transformer).
Finally, your custom estimator is just doing what a ColumnTransformer is meant to do. I think the only reason to prefer yours is if you need to search over which columns get which treatment; I don't think that's easy to do with a ColumnTransformer.
custom_poly = ColumnTransformer(
transformers=[('poly', PolynomialFeatures(), ['year', 'odometer'])],
remainder=OneHotEncoder(),
)
param_grid = {"cpf__poly__degree": [3, 4, 5]}
Related
I change the code from https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html a little bit, which looks like this:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear','rbf'), 'C':[10,20, 15, 4]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(iris.data, iris.target)
clf.best_params_
Then the result is:
{'C': 10, 'kernel': 'rbf'}
But if I change the code to:
parameters = {'kernel':('linear','rbf'), 'C':[4, 10,20, 15]}
You can see the only change is the sequence of C list. But the result is:
{'C': 4, 'kernel': 'rbf'}
It looks like GridSearchCV just uses the first parameter combination.
So I have a few questions about this:
In this case, scoring is the default (None), so what function actually uses here? And why the above situation happens?
As far as I know, when we use LatentDirichletAllocation and GridSearchCV, the scoring function is log likelihood even scoring=None. If I understand correctly, then GridSearchCV can automatically pick a scoring function when it combines different models?
According to RandomizedSearchCV documentation (emphasis mine):
param_distributions: dict or list of dicts
Dictionary with parameters names (str) as keys and distributions or
lists of parameters to try. Distributions must provide a rvs method
for sampling (such as those from scipy.stats.distributions). If a list
is given, it is sampled uniformly. If a list of dicts is given, first
a dict is sampled uniformly, and then a parameter is sampled using
that dict as above.
If my understanding of the above is correct, both algorithms (XGBClassifier and LogisticRegression) in the following example should be sampled with high probability (>99%), given n_iter = 10.
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
param_grid = [
{'scaler': [StandardScaler()],
'feature_selection': [RFE(estimator=XGBClassifier(use_label_encoder=False, eval_metric='logloss'))],
'feature_selection__n_features_to_select': [3],
'classification': [XGBClassifier(use_label_encoder=False, eval_metric='logloss')],
'classification__n_estimators': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
'classification__max_depth': [2, 5, 10],
},
{'scaler': [StandardScaler()],
'feature_selection': [RFE(estimator=LogisticRegression())],
'feature_selection__n_features_to_select': [3],
'classification': [LogisticRegression()],
'classification__C': [0.1],
},
]
pipe = Pipeline(steps=[('scaler', StandardScaler()), ('feature_selection', RFE(estimator=LogisticRegression())),
('classification', LogisticRegression())])
classifier = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid,
scoring='neg_brier_score', n_jobs=-1, verbose=10)
data = load_breast_cancer()
X = data.data
y = data.target.ravel()
classifier.fit(X, y)
What happens though is that every time I run it XGBClassifier gets chosen 10/10 times. I would expect one candidate to come from Logistic Regresion since the probability for each dict to be sampled is 50-50.
If the search space between the two algoritms is more balanced ('classification__n_estimators': [100]) then the sampling works as expected.
Can someone clarify what's going on here?
Yes, this is incorrect behavior. There's an Issue filed: when all the entries are lists (none are scipy distributions), the current code selects points from the ParameterGrid, which means it will disproportionately choose points from the larger dictionary-grid from your list.
Until a fix gets merged, you might be able to work around this by using a scipy distribution for something you don't care about, say for verbose?
This works (mostly from the demo sample at sklearn):
print(__doc__)
# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from scipy.stats import uniform
lregress = LinearRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('regress', lregress)])
# Plot the PCA spectrum
pca.fit(data_num)
plt.figure(1, figsize=(16, 9))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_')
# Prediction
n_components = uniform.rvs(loc=1, scale=data_num.shape[1], size=50,
random_state=42).astype(int)
# Parameters of pipelines can be set using ‘__’ separated parameter names:
estimator_pca = GridSearchCV(pipe,
dict(pca__n_components=n_components)
)
estimator_pca.fit(data_num, data_labels)
plt.axvline(estimator_pca.best_estimator_.named_steps['pca'].n_components,
linestyle=':', label='n_components chosen ' +
str(estimator_pca.best_estimator_.named_steps['pca'].n_components))
plt.legend(prop=dict(size=12))
plt.plot(np.cumsum(pca.explained_variance_ratio_), linewidth=1)
plt.show()
And this works:
from sklearn.feature_selection import RFECV
estimator = LinearRegression()
selector = RFECV(estimator, step=1, cv=5, scoring='explained_variance')
selector = selector.fit(data_num_pd, data_labels)
print("Selected number of features : %d" % selector.n_features_)
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(selector.grid_scores_) + 1), selector.grid_scores_)
plt.show()
but this gives me the error "RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes" on the line "selector1 = selector1.fit"
pca_est = estimator_pca.best_estimator_
selector1 = RFECV(pca_est, step=1, cv=5, scoring='explained_variance')
selector1 = selector1.fit(data_num_pd, data_labels)
print("Selected number of features : %d" % selector1.n_features_)
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(selector1.grid_scores_) + 1), selector1.grid_scores_)
plt.show()
How do I get my best-found PCA estimator to be used as the estimator in RFECV?
This is a known issue in pipeline design. Refer to the github page here:
Accessing fitted attributes:
Moreover, some fitted attributes are used by meta-estimators;
AdaBoostClassifier assumes its sub-estimator has a classes_ attribute
after fitting, which means that presently Pipeline cannot be used as
the sub-estimator of AdaBoostClassifier.
Either meta-estimators such as AdaBoostClassifier need to be
configurable in how they access this attribute, or meta-estimators
such as Pipeline need to make some fitted attributes of sub-estimators
accessible.
Same goes for other attributes like coef_ and feature_importances_. They are parts of last estimator so not exposed by pipeline.
Now you can try to follow the last para here and try to circumvent this to include it in pipeline, by doing something like this:
class Mypipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
And then using this new pipeline class in your code instead of original Pipeline.
This should work in most cases but not yours. You are doing feature reduction using PCA inside the pipeline. But want to do feature selection using RFECV. This in my opinion is not a good combination.
RFECV will keep on decreasing the number of features to be used. But the n_components in your best selected pca from above grid-search will be fixed. Then it will again throw an error when number of features become less than n_components. You cannot do anything in that case.
So I would advise you to think over your use case and code.
I am using scikit-learn with Pipeline and FeatureUnion to extract features from different inputs. Each sample (instance) in my dataset refers to documents with different lengths. My goal is to compute the top tfidf for each document independently, but I keep getting this error message:
ValueError: blocks[0,:] has incompatible row dimensions. Got
blocks[0,1].shape[0] == 1, expected 2000.
2000 is the size of the training data.
This is the main code:
book_summary= Pipeline([
('selector', ItemSelector(key='book')),
('tfidf', TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True))
])
book_contents= Pipeline([('selector3', book_content_count())])
ppl = Pipeline([
('feats', FeatureUnion([
('book_summary', book_summary),
('book_contents', book_contents)])),
('clf', SVC(kernel='linear', class_weight='balanced') ) # classifier with cross fold 5
])
I wrote two classes to handle each pipeline function. My problem is with book_contents pipeline which is mainly dealing with each sample and return TFidf matrix for each book independently.
class book_content_count():
def count_contents2(self, bookid):
book = open('C:/TheCorpus/'+str(int(bookid))+'_book.csv', 'r')
book_data = pd.read_csv(book, header=0, delimiter=',', encoding='latin1',error_bad_lines=False,dtype=str)
corpus=(str([user_data['text']]).strip('[]'))
return corpus
def transform(self, data_dict, y=None):
data_dict['bookid'] #from here take the name
text=data_dict['bookid'].apply(self.count_contents2)
vec_pipe= Pipeline([('vec', TfidfVectorizer(min_df = 1,lowercase = False, ngram_range = (1,1), use_idf = True, stop_words='english'))])
Xtr = vec_pipe.fit_transform(text)
return Xtr
def fit(self, x, y=None):
return self
Sample of data (example):
title Summary bookid
The beauty and the beast is a traditional fairy tale... 10
ocean at the end of the lane is a 2013 novel by British 11
Then each id will refer to a text file with the actual contents of these books
I have tried toarray and reshape functions but with no luck. Any idea how to solve this issue.
Thanks
You can use Neuraxle's Feature Union with a custom joiner that you would need to code yourself. The joiner is a class passed to Neuraxle's FeatureUnion to merge results together in the way you expected.
1. Import Neuraxle's classes.
from neuraxle.base import NonFittableMixin, BaseStep
from neuraxle.pipeline import Pipeline
from neuraxle.steps.sklearn import SKLearnWrapper
from neuraxle.union import FeatureUnion
2. Define your custom class by inheriting from BaseStep:
class BookContentCount(BaseStep):
def transform(self, data_dict, y=None):
transformed = do_things(...) # be sure to use SKLearnWrapper if you wrap sklearn items.
return transformed
def fit(self, x, y=None):
return self
3. Create a joiner to join the resuts of the feature union the way you wish:
class CustomJoiner(NonFittableMixin, BaseStep):
def __init__(self):
BaseStep.__init__(self)
NonFittableMixin.__init__(self)
# def fit: is inherited from `NonFittableMixin` and simply returns self.
def transform(self, data_inputs):
# TODO: insert your own concatenation method here.
result = np.concatenate(data_inputs, axis=-1)
return result
4. Finally create your pipeline by passing the joiner to the FeatureUnion:
book_summary= Pipeline([
ItemSelector(key='book'),
TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True)
])
p = Pipeline([
FeatureUnion([
book_summary,
BookContentCount()
],
joiner=CustomJoiner()
),
SVC(kernel='linear', class_weight='balanced')
])
Note: if you want your Neuraxle pipeline to become a scikit-learn pipeline back, you can do p = p.tosklearn().
To learn more on Neuraxle:
https://github.com/Neuraxio/Neuraxle
More examples from the documentation:
https://www.neuraxle.org/stable/examples/index.html
I've been learning and practicing sklearn library on my own. When I participated Kaggle competitions, I noticed the provided sample code used BaseEstimator from sklearn.base.
I don't quite understand how/why is BaseEstimator used.
from sklearn.base import BaseEstimator
class FeatureMapper:
def __init__(self, features):
self.features = features #features contains feature_name, column_name, and extractor( which is CountVectorizer)
def fit(self, X, y=None):
for feature_name, column_name, extractor in self.features:
extractor.fit(X[column_name], y) #my question is: is X features? if yes, where is it assigned? or else how can X call column_name by X[column_name].
...
This is what I usually see on sklearn's tutorial page:
from sklearn import SomeClassifier
X = [[0, 0], [1, 1],[2, 2],[3, 3]]
Y = [0, 1, 2, 3]
clf = SomeClassifier()
clf = clf.fit(X, Y)
I couldn't find a good example or any documentations on sklearn's official page. Although I found the sklearn.base code on github, but I'd like some examples and explanation of how is it used.
UPDATE
Here is the link for the sample code: https://github.com/benhamner/JobSalaryPrediction/blob/master/features.py
Correction: I just realized BaseEstimator is used for the class SimpleTransform. I guess my first question is why is it needed? (because it's not used anywhere in the computation), the other question is when define fit, what is X, and how is assigned? Because usually I see:
def mymethod(self, X, y=None):
X=self.features
# then do something to X[Column_name]
BaseEstimator provides among other things a default implementation for the get_params and set_params methods, see [the source code]. This is useful to make the model grid search-able with GridSearchCV for automated parameters tuning and behave well with others when combined in a Pipeline.