Suppose I have this Pipeline object:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('my_transform', my_transform()),
('estimator', SVC())
])
To pass the hyperparameters to my Support Vector Classifier (SVC) I could do something like this:
pipe_parameters = {
'estimator__gamma': (0.1, 1),
'estimator__kernel': (rbf)
}
Then, I could use GridSearchCV:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, pipe_parameters)
grid.fit(X_train, y_train)
We know that a linear kernel does not use gamma as a hyperparameter. So, how could I include the linear kernel in this GridSearch?
For example, In a simple GridSearch (without Pipeline) I could do:
param_grid = [
{'C': [ 0.1, 1, 10, 100, 1000],
'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
'kernel': ['rbf']},
{'C': [0.1, 1, 10, 100, 1000],
'kernel': ['linear']},
{'C': [0.1, 1, 10, 100, 1000],
'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
'degree': [2, 3],
'kernel': ['poly']}
]
grid = GridSearchCV(SVC(), param_grid)
Therefore, I need a working version of this sort of code:
pipe_parameters = {
'bag_of_words__max_features': (None, 1500),
'estimator__kernel': (rbf),
'estimator__gamma': (0.1, 1),
'estimator__kernel': (linear),
'estimator__C': (0.1, 1),
}
Meaning that I want to use as hyperparameters the following combinations:
kernel = rbf, gamma = 0.1
kernel = rbf, gamma = 1
kernel = linear, C = 0.1
kernel = linear, C = 1
You are almost there. Similar to how you created multiple dictionaries for SVC model, create a list of dictionaries for the pipeline.
Try this example:
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
remove = ('headers', 'footers', 'quotes')
data_train = fetch_20newsgroups(subset='train', categories=categories,
shuffle=True, random_state=42,
remove=remove)
pipe = Pipeline([
('bag_of_words', CountVectorizer()),
('estimator', SVC())])
pipe_parameters = [
{'bag_of_words__max_features': (None, 1500),
'estimator__C': [ 0.1, ],
'estimator__gamma': [0.0001, 1],
'estimator__kernel': ['rbf']},
{'bag_of_words__max_features': (None, 1500),
'estimator__C': [0.1, 1],
'estimator__kernel': ['linear']}
]
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, pipe_parameters, cv=2)
grid.fit(data_train.data, data_train.target)
grid.best_params_
# {'bag_of_words__max_features': None,
# 'estimator__C': 0.1,
# 'estimator__kernel': 'linear'}
Related
I am more interested in optimizing my multiclass problem with Brier score instead of accuracy. To achieve that, I am evaluating my classifiers with the results of predict_proba() like:
import numpy as np
probs = np.array(
[ [1, 0, 0],
[0, 1, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 0, 1]]
)
targets = np.array(
[[0.9, 0.05, 0.05],
[0.1, 0.8, 0.1],
[0.7, 0.2, 0.1],
[0.1, 0.9, 0],
[0, 0, 1],
[0.5, 0.3, 0.2],
[0.1, 0.5, 0.4],
[0.34, 0.33, 0.33]]
)
def brier_multi(targets, probs):
return np.mean(np.sum((probs - targets) ** 2, axis=1))
brier_multi(targets, probs)
Is it possible to optimize scikit-learns classifier directly during training for multiclass Brier score instead of accuracy?
Edit:
...
pipe = Pipeline(
steps=[
("preprocessor", preprocessor),
("selector", None),
("classifier", model.get("classifier")),
]
)
def brier_multi(targets, probs):
ohe_targets = OneHotEncoder().fit_transform(targets.reshape(-1, 1))
return np.mean(np.sum(np.square(probs - ohe_targets), axis=1))
brier_multi_loss = make_scorer(
brier_multi,
greater_is_better=False,
needs_proba=True,
)
search = GridSearchCV(
estimator=pipe,
param_grid=model.get("param_grid"),
scoring=brier_multi_loss,
cv=3,
n_jobs=-1,
refit=True,
verbose=3,
)
search.fit(X_train, y_train)
...
leads to nan as score
/home/andreas/.local/lib/python3.8/site-packages/sklearn/model_selection/_search.py:969: UserWarning: One or more of the test scores are non-finite: [nan nan nan nan nan nan nan nan nan]
warnings.warn(
You're already aware of the scoring parameter, so you just need to wrap your brier_multi into the format expected by GridSearchCV. There's a utility for that, make_scorer:
from sklearn.metrics import make_scorer
neg_mc_brier_score = make_scorer(
brier_multi,
greater_is_better=False,
needs_proba=True,
)
GridSearchCV(..., scoring=neg_mc_brier_score)
See the User Guide and the docs for make_scorer.
Unfortunately, that won't run, because your version of the scorer expects a one-hot-encoded targets array, whereas sklearn multiclass will send y_true as a 1d array. As a hack to make sure the rest works, you can modify:
def brier_multi(targets, probs):
ohe_targets = OneHotEncoder().fit_transform(targets.reshape(-1, 1))
return np.mean(np.sum(np.square(probs - ohe_targets), axis=1))
but I would encourage you to make this more robust (what if the classes aren't just 0, 1, ..., n_classes-1?).
For what it's worth, sklearn has a PR in progress to add multiclass Brier score: https://github.com/scikit-learn/scikit-learn/pull/22046 (be sure to see the linked PR18699, as it has the beginning of development and review).
I am trying to apply RandomizedSearchCV on a RegressorChain XGBoost model but I got an error : Invalid parameter learning_rate for estimator RegressorChain(base_estimator=XGBRegressor.
If I comment all the values in grid dict, it works otherwise it doesn't accept any param.
Same models (XGBRegressor and RegressorChain) are working fine alone. The RandomizedSearchCV is not accepting the the params in grid dict
# Setup the parameters grid
grid = {
'n_estimators': [100, 500, 1000],
'max_depth': [5, 10, 20, 30],
'max_features': ["auto", "sqrt"],
'eta': [0.09, 0.1, 0.2],
'booster': ["dart", "gblinear"]
}
clf = XGBRegressor(objective='reg:squarederror')
chain = RegressorChain(base_estimator=clf, order=[0, 1, 2, 3, 4,5])
# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=chain,
param_distributions=grid,
n_iter=10, # number of models to try
cv=5,
verbose=1,
random_state=42,
refit=True)
# Fit the RandomizedSearchCV version of clf
rs_clf.fit(X_train, y_train) # 'rs' is short
Since the XGBRegressor is the base_estimator of RegressorChain, the parameters of XGBRegressor become nested and must be addressed with base_estimator__xxx:
grid = {
'base_estimator__n_estimators': [100, 500, 1000],
'base_estimator__max_depth': [5, 10, 20, 30],
'base_estimator__max_features': ["auto", "sqrt"],
'base_estimator__eta': [0.09, 0.1, 0.2],
'base_estimator__booster': ["dart", "gblinear"]
}
I followed example and tried to use gridsearch with a random forest classifier to generate roc_auc_score, however, the y_prob=model.predict_proba(X_test)
I generated was in list (two arrays) rather than one. So I was wondering what went wrong here.
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
from sklearn.metrics import roc_auc_score
X = np.random.rand(50,10)
y = np.random.permutation([1] * 25 + [0] * 25)
y= label_binarize(y, classes=[0, 1])
y= np.hstack((1-y, y))
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=7)
index_split = sss.split(X, y)
train_index = []
test_index = []
for train_ind, test_ind in index_split:
train_index.extend(train_ind)
test_index.extend(test_ind)
data_train = X[train_index]
out_train = y[train_index]
data_test = X[test_index]
out_test = y[test_index]
rf = RandomForestClassifier()
grids = {
'n_estimators': [10, 50, 100, 200],
'max_features': ['auto', 'sqrt', 'log2'],
'criterion': ['gini', 'entropy']
}
rf_grids_searched = GridSearchCV(rf,
grids,
scoring = "roc_auc",
n_jobs = -1,
refit=True,
cv = 5,
verbose=10)
rf_grids_searched.fit(data_train, out_train)
rf_best = rf_grids_searched.best_estimator_
y_prob=rf_best.predict_proba(data_test)
print(roc_auc_score(out_test, y_prob))
my result:
array([[0.5, 0.5],
[0.5, 0.5],
[0.7, 0.3],
[0.3, 0.7],
[0.7, 0.3],
[0.5, 0.5],
[0.1, 0.9],
[0.6, 0.4],
[0.6, 0.4],
[0.4, 0.6]]), array([[0.5, 0.5],
[0.5, 0.5],
[0.3, 0.7],
[0.7, 0.3],
[0.3, 0.7],
[0.5, 0.5],
[0.9, 0.1],
[0.4, 0.6],
[0.4, 0.6],
[0.6, 0.4]])]
expected results with probability of [0,1]:
array([[0.5, 0.5],
[0.5, 0.5],
[0.7, 0.3],
[0.3, 0.7],
[0.7, 0.3],
[0.5, 0.5],
[0.1, 0.9],
[0.6, 0.4],
[0.6, 0.4],
I also tried not to binarize y in the first place and then train gridsearch to get the following array y_prob. Later, I binarize y_test to match the dimension of y_prob and get the score. I was wondering if the sequence is correct?
code:
out_test1= label_binarize(out_test, classes=[0, 1])
out_test1= np.hstack((1-out_test1, out_test1))
print(roc_auc_score(out_test1, y_prob))
array([[0.6, 0.4],
[0.5, 0.5],
[0.6, 0.4],
[0.5, 0.5],
[0.7, 0.3],
[0.3, 0.7],
[0.8, 0.2],
[0.4, 0.6],
[0.8, 0.2],
[0.4, 0.6]])
The grid search's predict_proba method is just a dispatch to the best estimator's predict_proba. And from the docstring for RandomForestClassifier.predict_proba (emphasis added):
Returns
p : ndarray of shape (n_samples, n_classes), or a list of n_outputs
such arrays if n_outputs > 1. ...
Since you've specified two outputs (two columns in y), you get predicted probabilities for each of the two classes for each of the two targets.
I am trying to use scikit-learn GridSearchCV together with XGBoost XGBClassifier wrapper for my unbalanced multi-class classification problem. So far I have used a list of class weights as an input for the scale_pos_weight argument, but this does not seem to work as all my predictions are for the majority class. This is probably because in the documentation of the XGBClassifier it is mentioned that scale_pos_weight can only be used for binary classification problems.
So my question is, how can I input sample/class weights for a multi-class classification task using scikit-learn GridSearchCV?
My code is below:
import numpy as np
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight('balanced', np.unique(training_targets),
training_targets[target_label[0]])
random_state = np.random.randint(0, 1000)
parameters = {
'max_depth': [3, 4, 5],
'learning_rate': [0.1, 0.2, 0.3],
'n_estimators': [50, 100, 150],
'gamma': [0, 0.1, 0.2],
'min_child_weight': [0, 0.5, 1],
'max_delta_step': [0],
'subsample': [0.7, 0.8, 0.9, 1],
'colsample_bytree': [0.6, 0.8, 1],
'colsample_bylevel': [1],
'reg_alpha': [0, 1e-2, 1, 1e1],
'reg_lambda': [0, 1e-2, 1, 1e1],
'base_score': [0.5]
}
xgb_model = xgb.XGBClassifier(scale_pos_weight = class_weights, silent = True,
random_state = random_state)
clf = GridSearchCV(xgb_model, parameters, scoring = 'f1_micro', n_jobs = -1, cv = 5)
clf.fit(training_features, training_targets.values[:, 0])
model = clf.best_estimator_
The scale_pos_weight is only for binary classification, so it won't work on multi-label classification tasks.
For your case, it's more advisable to use the weight parameter as described here (https://xgboost.readthedocs.io/en/latest/python/python_api.html). The argument will be an array which each element represents the weight you assigned for the corresponding data point.
The idea is essentially to manually assign different weights to different classes. There's no standard in how you need to assign weights, it's more up to your decision. The more weight a sample is being assigned, the more it affects the objective function during the training.
However, if you use the scikit learn API format, you cannot specify the weight parameter nor using the DMAtrix format. Thankfully, xgboost has its own cross validation function, which you can find details here: https://xgboost.readthedocs.io/en/latest/python/python_api.html
I suggest that you use the compute_sample_weight() function and set weights for each sample by looking at your labels. This will solve your problem in the most elegant way. See below for 3 classes (-1,0,1):
sample_weights=compute_sample_weight({-1:4,0:1,1:4},Train_Labels)
random_search = RandomizedSearchCV(model, param_distributions=params, n_iter=param_comb,return_train_score=True, scoring=score,cv=ps, n_jobs=-1, verbose=3, random_state=1001 )
random_search.fit(Train,Train_Labels,sample_weight=sample_weights)
In a multi-class setup we need to pass sample_weight parameter with a list of values (weights) matching the count of data-points (for example number of rows in X_train), to fit() of XGBoostClassifier. Check the docs.
While using XGBoostClassifier with scikit-learn GridSearchCV, you can pass sample_weight directly to the fit() of GridSearchCV.
Note: Tried in scikit-learn version 1.1.1. Not sure from which version onwards this is supported.
For example:
def get_weights(cls):
class_weights = {
# class-labels based on your dataset.
0: 1,
1: 4,
2: 1,
}
return [class_weights[cl] for cl in cls]
grid = {
"max_depth": [3, 4, 5, 6],
"n_estimators": range(20, 70, 10),
"learning_rate": np.arange(0.25, 0.50, 0.05),
}
xgb_clf = XGBClassifier(random_state=42, n_jobs=-1)
xgb_cvm = GridSearchCV(estimator=xgb_clf, param_grid=grid, n_jobs=-1, cv=5)
xgb_cvm.fit(X, y, sample_weight=get_weights(y))
from mlxtend.regressor import StackingRegressor
from sklearn.ensemble.forest import RandomForestRegressor as RFR
from sklearn.ensemble import GradientBoostingRegressor as GBR
import xgboost as xgb
rfr = RFR(n_estimators=500, n_jobs=cc.ncpu, random_state=0)
gbr = GBR(n_estimators=1000, random_state=0)
xgr = xgb.XGBRegressor()
mtr = RFR() # meta regressor
regressors = [rfr, gbr, xgr]
model = StackingRegressor(regressors=regressors, meta_regressor=mtr)
param_grid = {
'fs__threshold': ['median'],
'fs__estimator__max_features': ['log2'],
'clf__rfr__max_features': ['auto', 'log2'],
'clf__gbr__learning_rate': [0.05, 0.02, 0.01],
'clf__gbr__max_depth': [4, 5, 6, 7],
'clf__gbr__max_features': ['auto', 'log2'],
'clf__gbr__n_estimators': [500, 1000, 2000],
'clf__xgr__learning_rate': [0.001, 0.05, 0.1, 0.2],
'clf__xgr__max_depth': [2, 4, 6],
'clf__xgr__min_child_weight': [1, 3, 5],
'clf__xgr__n_estimators': [500, 1000],
'clf__meta-mtr__n_estimators': [750, 1500]
}
rf_feature_imp = RFR(250, n_jobs=cc.ncpu)
feat_selection = SelectFromModel(rf_feature_imp)
pipeline = Pipeline([('fs', feat_selection), ('clf', model), ])
gs = GridSearchCV(pipeline, param_grid=param_grid, verbose=1, n_jobs=-1, error_score=np.nan)
In the code above, I want to use the mlxtend voting regressor and also use a random forest to select relevant features. However, this code is not working and I get an error
ValueError: Invalid parameter xgr for estimator StackingRegressor(meta_regressor=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False),
regressors=[RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=500, n_jobs=5, oob_sc...eg:linear', reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=0, silent=True, subsample=1)],
verbose=0). Check the list of available parameters with `estimator.get_params().keys()`.
How to fix this?