I change the code from https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html a little bit, which looks like this:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear','rbf'), 'C':[10,20, 15, 4]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(iris.data, iris.target)
clf.best_params_
Then the result is:
{'C': 10, 'kernel': 'rbf'}
But if I change the code to:
parameters = {'kernel':('linear','rbf'), 'C':[4, 10,20, 15]}
You can see the only change is the sequence of C list. But the result is:
{'C': 4, 'kernel': 'rbf'}
It looks like GridSearchCV just uses the first parameter combination.
So I have a few questions about this:
In this case, scoring is the default (None), so what function actually uses here? And why the above situation happens?
As far as I know, when we use LatentDirichletAllocation and GridSearchCV, the scoring function is log likelihood even scoring=None. If I understand correctly, then GridSearchCV can automatically pick a scoring function when it combines different models?
Related
I am trying to use GridSearchCV with xgbranker estimator from xgboost. I am trying to use GroupKFold and passing qid (group_ids) parameter to the grid fit method but it's not straightforward. After a bit of hit and trial with solutions already suggested on the web, I finally zeroed on a approach. I am still getting an error which seems to be in the scoring method passed. Any help or working example would be great?
Sample code:
from sklearn.model_selection import GroupKFold, GridSearchCV
from sklearn.metrics import make_scorer, ndcg_score
ndcg_scorer = make_scorer(ndcg_score)
param_grid = {
'learning_rate': [0.001, 0.01, 0.02],
'n_estimators': [10, 50]
}
splits = 3
gkf = GroupKFold(n_splits=splits)
cv_group = gkf.split(X_train, y_train, qids_train)
def group_gen():
for ids,_ in cv_group:
yield ids
grid = GridSearchCV(my_model, param_grid, cv=splits, scoring=ndcg_scorer, refit=False)
grid.fit(X_train, y_train, qid=next(group_gen()))
I get below error:
ValueError: Only ('multilabel-indicator', 'continuous-multioutput', 'multiclass-multioutput') formats are supported. Got multiclass instead
The error seems to be related to the scoring method you use, but you didn't share anything about your data. so it's hard to say what exactly is the problem.
It seems to me that you're using for the scoring a method that expects something else then you're providing as a label.
I have the following way to create the grid_cv_object. Where hyperpam_grid={"C":c, "kernel":kernel, "gamma":gamma, "degree":degree}.
grid_cv_object = GridSearchCV(
estimator = SVC(cache_size=cache_size),
param_grid = hyperpam_grid,
cv = cv_splits,
scoring = make_scorer(matthews_corrcoef), # a callable returning single value, binary and multiclass labels are supported
n_jobs = -1, # use all processors
verbose = 10,
refit = refit
)
Here kernel can be ('rbf', 'linear', 'poly') for example.
How can I enforce the selection of LinearSVC for the 'linear' kernel? Since this is embedded in hyperparam_grid I'm not sure how to create this sort of "switch".
I just don't want to have 2 separate grid_cv_objects if possible.
Try making parameter grids in the following form
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
search_spaces = [
{'svm': [SVC(kernel='rbf')],
'svm__gamma': ('scale', 'auto'),
'svm__C': (0.1, 1.0, 10.0)},
{'svm': [SVC(kernel='poly')],
'svm__degree': (2, 3),
'svm__C': (0.1, 1.0, 10.0)},
{'svm': [LinearSVC()], # Linear kernel
'svm__C': (0.1, 1.0, 10.)}
]
svm_pipe = Pipeline([('svm', DummyClassifier())])
grid = GridSearchCV(svm_pipe, search_spaces)
Discussion:
We separate different kernels with different instances of SVC. This way, GridSearchCV will not estimate, say, SVC(kernel='poly') with different gammas, which are ignored for 'poly' and are designated only for rbf.
As you request, LinearSVC (and in fact any other model), not SVC(kernel='linear'), is separated to estimate a linear svm.
Best estimator will be grid.best_estimator_.named_steps['svm'].
I'm trying to perform a GridSearchCV on a pipeline with a custom transformer. The transformer enriches the features "year" and "odometer" polynomially and one hot encodes the rest of the features. The ML model is a simple linear regression model.
custom transformer code:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
class custom_poly_features(TransformerMixin, BaseEstimator):
def __init__(self, degree = 2, poly_features = ['year', 'odometer']):
self.degree_ = degree
self.poly_features_ = poly_features
def fit(self, X, y=None):
# Return the classifier
return self
def transform(self, X, y=None):
poly_feat = PolynomialFeatures(degree=self.degree_)
OneHot = OneHotEncoder(sparse=False)
not_poly_features = list(set(X.columns) - set(self.poly_features_))
poly = poly_feat.fit_transform(X[self.poly_features_].to_numpy())
poly = np.hstack([poly, OneHot.fit_transform(X[not_poly_features].to_numpy())])
return poly
def get_params(self, deep=True):
return {"degree": self.degree_, "poly_features": self.poly_features_}
pipeline & gridsearch code:
#create pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
poly_pipeline = Pipeline(steps=[("cpf", custom_poly_features()), ("lin_reg", LinearRegression(n_jobs=-1))])
#perform gridsearch
from sklearn.model_selection import GridSearchCV
param_grid = {"cpf__degree": [3, 4, 5]}
search = GridSearchCV(poly_pipeline, param_grid, n_jobs=-1, cv=3)
search.fit(X_train_ordinal, y_train)
The custom transformer itself works fine and the pipeline also works (although the score is not great, but that is not the topic here).
poly_pipeline.fit(X_train, y_train).score(X_test, y_test)
Output:
0.543546844381771
However, when I perform the gridsearch, the scores are all nan values:
search.cv_results_
Output:
{'mean_fit_time': array([4.46928191, 4.58259885, 4.55605125]),
'std_fit_time': array([0.18111937, 0.03305779, 0.02080789]),
'mean_score_time': array([0.21119197, 0.13816587, 0.11357466]),
'std_score_time': array([0.09206233, 0.02171508, 0.02127906]),
'param_custom_poly_features__degree': masked_array(data=[3, 4, 5],
mask=[False, False, False],
fill_value='?',
dtype=object),
'params': [{'custom_poly_features__degree': 3},
{'custom_poly_features__degree': 4},
{'custom_poly_features__degree': 5}],
'split0_test_score': array([nan, nan, nan]),
'split1_test_score': array([nan, nan, nan]),
'split2_test_score': array([nan, nan, nan]),
'mean_test_score': array([nan, nan, nan]),
'std_test_score': array([nan, nan, nan]),
'rank_test_score': array([1, 2, 3])}
Does anyone know what the problem is? The transformer and the pipeline work fine on their own after all.
To debug searches in general, set error_score='raise', so that you get a full error traceback.
Your issue appears to be data-dependent; I can run this just fine on a custom dataset. That suggests to me that the comment by #Sanjar Adylov not only highlights an important issue, but the issue for your data: the train folds sometimes contain different values in some categorical feature(s) than the test folds, and so the one-hot encodings end up with different numbers of features, and the linear model justifiably breaks.
So the fix there is also as Sanjar says: instantiate, store as attributes, and fit the two transformers and in your fit method, and use their transform methods in your transform method.
You will find there is another big issue: all the scores in cv_results_ are the same. This is because you can't actually set the hyperparameters correctly, because in __init__ you've used mismatching names (degree as the parameter but degree_ as the attribute). Read more in the developer guide. (I think you can get around this by editing set_params similar to how you edited get_params, but it would be much easier to actually rely on the BaseEstimator versions of those and just match the parameter names to the attribute names.)
Also, note that setting a parameter default to a list can have surprising effects. Consider alternatives to the default of poly_features in __init__.
class custom_poly_features(TransformerMixin, BaseEstimator):
def __init__(self, degree=2, poly_features=['year', 'odometer']):
self.degree = degree
self.poly_features = poly_features
def fit(self, X, y=None):
self.poly_feat = PolynomialFeatures(degree=self.degree)
self.onehot = OneHotEncoder(sparse=False)
self.not_poly_features_ = list(set(X.columns) - set(self.poly_features))
self.poly_feat.fit(X[self.poly_features])
self.onehot.fit(X[self.not_poly_features_])
return self
def transform(self, X, y=None):
poly = self.poly_feat.transform(X[self.poly_features])
poly = np.hstack([poly, self.onehot.transform(X[self.not_poly_features_])
return poly
There are some additional things you might want to add, like checks for whether poly_features or not_poly_features_ is empty (which would break the corresponding transformer).
Finally, your custom estimator is just doing what a ColumnTransformer is meant to do. I think the only reason to prefer yours is if you need to search over which columns get which treatment; I don't think that's easy to do with a ColumnTransformer.
custom_poly = ColumnTransformer(
transformers=[('poly', PolynomialFeatures(), ['year', 'odometer'])],
remainder=OneHotEncoder(),
)
param_grid = {"cpf__poly__degree": [3, 4, 5]}
According to RandomizedSearchCV documentation (emphasis mine):
param_distributions: dict or list of dicts
Dictionary with parameters names (str) as keys and distributions or
lists of parameters to try. Distributions must provide a rvs method
for sampling (such as those from scipy.stats.distributions). If a list
is given, it is sampled uniformly. If a list of dicts is given, first
a dict is sampled uniformly, and then a parameter is sampled using
that dict as above.
If my understanding of the above is correct, both algorithms (XGBClassifier and LogisticRegression) in the following example should be sampled with high probability (>99%), given n_iter = 10.
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
param_grid = [
{'scaler': [StandardScaler()],
'feature_selection': [RFE(estimator=XGBClassifier(use_label_encoder=False, eval_metric='logloss'))],
'feature_selection__n_features_to_select': [3],
'classification': [XGBClassifier(use_label_encoder=False, eval_metric='logloss')],
'classification__n_estimators': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
'classification__max_depth': [2, 5, 10],
},
{'scaler': [StandardScaler()],
'feature_selection': [RFE(estimator=LogisticRegression())],
'feature_selection__n_features_to_select': [3],
'classification': [LogisticRegression()],
'classification__C': [0.1],
},
]
pipe = Pipeline(steps=[('scaler', StandardScaler()), ('feature_selection', RFE(estimator=LogisticRegression())),
('classification', LogisticRegression())])
classifier = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid,
scoring='neg_brier_score', n_jobs=-1, verbose=10)
data = load_breast_cancer()
X = data.data
y = data.target.ravel()
classifier.fit(X, y)
What happens though is that every time I run it XGBClassifier gets chosen 10/10 times. I would expect one candidate to come from Logistic Regresion since the probability for each dict to be sampled is 50-50.
If the search space between the two algoritms is more balanced ('classification__n_estimators': [100]) then the sampling works as expected.
Can someone clarify what's going on here?
Yes, this is incorrect behavior. There's an Issue filed: when all the entries are lists (none are scipy distributions), the current code selects points from the ParameterGrid, which means it will disproportionately choose points from the larger dictionary-grid from your list.
Until a fix gets merged, you might be able to work around this by using a scipy distribution for something you don't care about, say for verbose?
Does anyone used any optimization models on fitted sklearn models?
What I'd like to do is fit model based on train data and using this model try to find the best combination of parameters for which model would predict the biggest value.
Some example, simplified code:
import pandas as pd
df = pd.DataFrame({
'temperature': [10, 15, 30, 20, 25, 30],
'working_hours': [10, 12, 12, 10, 30, 15],
'sales': [4, 7, 6, 7.3, 10, 8]
})
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
X = df.drop(['sales'], axis=1)
y = df['sales']
model.fit(X, y);
Our baseline is a simple loop and predict all combination of variables:
results = pd.DataFrame(columns=['temperature', 'working_hours', 'sales_predicted'])
import numpy as np
for temp in np.arange(1,100.01,1):
for work_hours in np.arange(1,60.01,1):
results = pd.concat([
results,
pd.DataFrame({
'temperature': temp,
'working_hours': work_hours,
'sales_predicted': model.predict(np.array([temp, work_hours]).reshape(1,-1))
}
)
]
)
print(results.sort_values(by='sales_predicted', ascending=False))
Using that way it's difficult or impossible to:
* do it fast (brute method)
* implement constraint concerning two or more variables dependency
We tried PuLP library and PyOmo library, but both doesn't allow to put model.predict function as an objective function returning error:
TypeError: float() argument must be a string or a number, not 'LpVariable'
Do anyone have any idea how we can get rid off loop and use some other stuff?
When people talk about optimizing fitted sklearn models, they usually mean maximizing accuracy/performance metrics. So if you are trying to maximize your predicted value, you can definitely improve your code to achieve it more efficiently, like below.
You are collecting all the predictions in a big results dataframe, and then sorting it in ascending order. Instead, you can just search for an increase in your target variable (sales_predicted) on-the-fly, using a simple if logic. So just change your loop into this:
max_sales_predicted = 0
for temp in np.arange(1, 100.01, 1):
for work_hours in np.arange(1, 60.01, 1):
sales_predicted = model.predict(np.array([temp, work_hours]).reshape(1, -1))
if sales_predicted > max_sales_predicted:
max_sales_predicted = sales_predicted
desired_temp = temp
desired_work_hours = work_hours
So that you can only take into account any specification that produces a predictiong that exceeds your current target, and else, do nothing.
The result of my code is the same as yours, i.e. a max_sales_predicted value of 9.2. Also, desired_temp and desired_work_hours now give you the specification that produce that maxima. Hope this helps.