This is a follow up on a question answered here, but I believe it deserves its own thread.
In the previous question, we were dealing with “an Ensemble of Ensemble classifiers, where each has its own parameters.” Let's start with the example provided by MaximeKan in his answer:
my_est = BaggingClassifier(RandomForestClassifier(n_estimators = 100, bootstrap = True,
max_features = 0.5), n_estimators = 5, bootstrap_features = False, bootstrap = False,
max_features = 1.0, max_samples = 0.6 )
Now say I want to go one level above that: Considerations like efficiency, computational cost, etc., aside, and as a general concept: How would I ran grid search with this kind of setup?
I can set up two parameter grids along these lines:
One for the BaggingClassifier:
BC_param_grid = {
'bootstrap': [True, False],
'bootstrap_features': [True, False],
'n_estimators': [5, 10, 15],
'max_samples' : [0.6, 0.8, 1.0]
}
And one for the RandomForestClassifier:
RFC_param_grid = {
'bootstrap': [True, False],
'n_estimators': [100, 200, 300],
'max_features' : [0.6, 0.8, 1.0]
}
Now I can call grid search with my estimator:
grid_search = GridSearchCV(estimator = my_est, param_grid = ???)
What do I do with the param_grid parameter in this case? Or more specifically, how do I use both of the parameter grids I set up?
I have to say, it feels like I’m playing with matryoshka dolls.
Following #James Dellinger comment above, and expanding from there, I was able to get it done. Turns out the "secret sauce" is indeed a mostly-undocumented feature - the __ (double underline) separator (there's some passing reference to it in the Pipeline documentation): it seems that adding the inside/base estimator name, followed by this __ to the name of an inside/base estimator parameter, allows you to create a param_grid which covers parameters for both the outside and inside estimators.
So for the example in the question, the outside estimator is BaggingClassifier and the inside/base estimator is RandomForestClassifier. So what you need to do is, first, to import what needs to be imported:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.model_selection import GridSearchCV
followed by the param_grid assignments (in this case, those in example in the question):
param_grid = {
'bootstrap': [True, False],
'bootstrap_features': [True, False],
'n_estimators': [5, 10, 15],
'max_samples' : [0.6, 0.8, 1.0],
'base_estimator__bootstrap': [True, False],
'base_estimator__n_estimators': [100, 200, 300],
'base_estimator__max_features' : [0.6, 0.8, 1.0]
}
And, finally, your grid search:
grid_search=GridSearchCV(BaggingClassifier(base_estimator=RandomForestClassifier()), param_grid=param_grid, cv=5)
And you're off to the races.
Related
So I did the following:
MLP = MLPRegressor()
parameter_space = {
'hidden_layer_sizes': [(32,), (32,16), (32,16,8), (32,16,8,4), (32,16,8,4,2), (32,32), (32,32,32), (32,32,32,32), (32,32,32,32,32), (16,8,4,2)],
'activation': ['relu'],
'solver': ['adam'],
'learning_rate_init': [1, 0.1, 0.01, 0.001,0.0001,0.00001],
'max_iter': [5000],
'shuffle': [True, False],
'random_state': [0],
'early_stopping': [True, False],
'n_iter_no_change': [50],
}
gs_MLP = GridSearchCV(estimator = MLP, param_grid= parameter_space, cv = 7, n_jobs = -1)
gs_MLP_fit = gs_MLP.fit(X, y)
gs_MLP.score(X,y)
And I noticed that whenever I change the order within the hidden_layer_size it gives different answers. First it said (16,8,4,2) and when I put (16,8,4,2) at end it said (32,32,32,32) is the best.
I assume this has to do with the random_state? Do I have to put it in MLPRegressor() instead? As in MLPRegressor(random_state = 0)
I am trying to apply RandomizedSearchCV on a RegressorChain XGBoost model but I got an error : Invalid parameter learning_rate for estimator RegressorChain(base_estimator=XGBRegressor.
If I comment all the values in grid dict, it works otherwise it doesn't accept any param.
Same models (XGBRegressor and RegressorChain) are working fine alone. The RandomizedSearchCV is not accepting the the params in grid dict
# Setup the parameters grid
grid = {
'n_estimators': [100, 500, 1000],
'max_depth': [5, 10, 20, 30],
'max_features': ["auto", "sqrt"],
'eta': [0.09, 0.1, 0.2],
'booster': ["dart", "gblinear"]
}
clf = XGBRegressor(objective='reg:squarederror')
chain = RegressorChain(base_estimator=clf, order=[0, 1, 2, 3, 4,5])
# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=chain,
param_distributions=grid,
n_iter=10, # number of models to try
cv=5,
verbose=1,
random_state=42,
refit=True)
# Fit the RandomizedSearchCV version of clf
rs_clf.fit(X_train, y_train) # 'rs' is short
Since the XGBRegressor is the base_estimator of RegressorChain, the parameters of XGBRegressor become nested and must be addressed with base_estimator__xxx:
grid = {
'base_estimator__n_estimators': [100, 500, 1000],
'base_estimator__max_depth': [5, 10, 20, 30],
'base_estimator__max_features': ["auto", "sqrt"],
'base_estimator__eta': [0.09, 0.1, 0.2],
'base_estimator__booster': ["dart", "gblinear"]
}
I have multi variate time series data, want to detect the anomalies with isolation forest algorithm.
want to get best parameters from gridSearchCV, here is the code snippet of gridSearch CV.
input data set loaded with below snippet.
df = pd.read_csv("train.csv")
df.drop(['dataTimestamp','Anomaly'], inplace=True, axis=1)
X_train = df
y_train = df1[['Anomaly']] ( Anomaly column is labelled data).
define the parameters for Isolation Forest.
clf = IsolationForest(random_state=47, behaviour='new', score="accuracy")
param_grid = {'n_estimators': list(range(100, 800, 5)), 'max_samples': list(range(100, 500, 5)), 'contamination': [0.1, 0.2, 0.3, 0.4, 0.5], 'max_features': [5,10,15], 'bootstrap': [True, False], 'n_jobs': [5, 10, 20, 30]}
f1sc = make_scorer(f1_score)
grid_dt_estimator = model_selection.GridSearchCV(clf, param_grid,scoring=f1sc, refit=True,cv=10, return_train_score=True)
grid_dt_estimator.fit(X_train, y_train)
after executing the fit , got the below error.
ValueError: Target is multiclass but average='binary'. Please choose another average setting.
Can some one guide me what is this about, tried average='weight', but still no luck, anything am doing wrong here.
please let me know how to get F-score as well.
You incur in this error because you didn't set the parameter average when transforming the f1_score into a scorer. In fact, as detailed in the documentation:
average : string, [None, ‘binary’ (default), ‘micro’, ‘macro’,
‘samples’, ‘weighted’] This parameter is required for
multiclass/multilabel targets. If None, the scores for each class are
returned.
The consequence is that the scorer returns multiple scores for each class in your classification problem, instead of a single measure. The solution is to declare one of the possible values of the average parameter for f1_score, depending on your needs. I therefore refactored the code you provided as an example in order to provide a possible solution to your problem:
from sklearn.ensemble import IsolationForest
from sklearn.metrics import make_scorer, f1_score
from sklearn import model_selection
from sklearn.datasets import make_classification
X_train, y_train = make_classification(n_samples=500,
n_classes=2)
clf = IsolationForest(random_state=47, behaviour='new')
param_grid = {'n_estimators': list(range(100, 800, 5)),
'max_samples': list(range(100, 500, 5)),
'contamination': [0.1, 0.2, 0.3, 0.4, 0.5],
'max_features': [5,10,15],
'bootstrap': [True, False],
'n_jobs': [5, 10, 20, 30]}
f1sc = make_scorer(f1_score(average='micro'))
grid_dt_estimator = model_selection.GridSearchCV(clf,
param_grid,
scoring=f1sc,
refit=True,
cv=10,
return_train_score=True)
grid_dt_estimator.fit(X_train, y_train)
Update make_scorer with this to get it working.
make_scorer(f1_score, average='micro')
Parameters you tune are not all necessary.
For example:
contamination is the rate for abnomaly, you can determin the best value after you fitted a model by tune the threshold on model.score_samples
n_jobs is the CPU core you used.
I am trying to use scikit-learn GridSearchCV together with XGBoost XGBClassifier wrapper for my unbalanced multi-class classification problem. So far I have used a list of class weights as an input for the scale_pos_weight argument, but this does not seem to work as all my predictions are for the majority class. This is probably because in the documentation of the XGBClassifier it is mentioned that scale_pos_weight can only be used for binary classification problems.
So my question is, how can I input sample/class weights for a multi-class classification task using scikit-learn GridSearchCV?
My code is below:
import numpy as np
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight('balanced', np.unique(training_targets),
training_targets[target_label[0]])
random_state = np.random.randint(0, 1000)
parameters = {
'max_depth': [3, 4, 5],
'learning_rate': [0.1, 0.2, 0.3],
'n_estimators': [50, 100, 150],
'gamma': [0, 0.1, 0.2],
'min_child_weight': [0, 0.5, 1],
'max_delta_step': [0],
'subsample': [0.7, 0.8, 0.9, 1],
'colsample_bytree': [0.6, 0.8, 1],
'colsample_bylevel': [1],
'reg_alpha': [0, 1e-2, 1, 1e1],
'reg_lambda': [0, 1e-2, 1, 1e1],
'base_score': [0.5]
}
xgb_model = xgb.XGBClassifier(scale_pos_weight = class_weights, silent = True,
random_state = random_state)
clf = GridSearchCV(xgb_model, parameters, scoring = 'f1_micro', n_jobs = -1, cv = 5)
clf.fit(training_features, training_targets.values[:, 0])
model = clf.best_estimator_
The scale_pos_weight is only for binary classification, so it won't work on multi-label classification tasks.
For your case, it's more advisable to use the weight parameter as described here (https://xgboost.readthedocs.io/en/latest/python/python_api.html). The argument will be an array which each element represents the weight you assigned for the corresponding data point.
The idea is essentially to manually assign different weights to different classes. There's no standard in how you need to assign weights, it's more up to your decision. The more weight a sample is being assigned, the more it affects the objective function during the training.
However, if you use the scikit learn API format, you cannot specify the weight parameter nor using the DMAtrix format. Thankfully, xgboost has its own cross validation function, which you can find details here: https://xgboost.readthedocs.io/en/latest/python/python_api.html
I suggest that you use the compute_sample_weight() function and set weights for each sample by looking at your labels. This will solve your problem in the most elegant way. See below for 3 classes (-1,0,1):
sample_weights=compute_sample_weight({-1:4,0:1,1:4},Train_Labels)
random_search = RandomizedSearchCV(model, param_distributions=params, n_iter=param_comb,return_train_score=True, scoring=score,cv=ps, n_jobs=-1, verbose=3, random_state=1001 )
random_search.fit(Train,Train_Labels,sample_weight=sample_weights)
In a multi-class setup we need to pass sample_weight parameter with a list of values (weights) matching the count of data-points (for example number of rows in X_train), to fit() of XGBoostClassifier. Check the docs.
While using XGBoostClassifier with scikit-learn GridSearchCV, you can pass sample_weight directly to the fit() of GridSearchCV.
Note: Tried in scikit-learn version 1.1.1. Not sure from which version onwards this is supported.
For example:
def get_weights(cls):
class_weights = {
# class-labels based on your dataset.
0: 1,
1: 4,
2: 1,
}
return [class_weights[cl] for cl in cls]
grid = {
"max_depth": [3, 4, 5, 6],
"n_estimators": range(20, 70, 10),
"learning_rate": np.arange(0.25, 0.50, 0.05),
}
xgb_clf = XGBClassifier(random_state=42, n_jobs=-1)
xgb_cvm = GridSearchCV(estimator=xgb_clf, param_grid=grid, n_jobs=-1, cv=5)
xgb_cvm.fit(X, y, sample_weight=get_weights(y))
I'm building a grid search of multiple classifiers and want to use recursive feature elimination with cross validation. I started with the code as provided in Recursive feature elimination and grid search using scikit-learn. Below is my working code:
param_grid = [{'C': 0.001}, {'C': 0.01}, {'C': .1}, {'C': 1.0}, {'C': 10.0},
{'C': 100.0}, {'fit_intercept': True}, {'fit_intercept': False},
{'penalty': 'l1'}, {'penalty': 'l2'}]
estimator = LogisticRegression()
selector = RFECV(estimator, step=1, cv=5, scoring="roc_auc")
clf = grid_search.GridSearchCV(selector, {"estimator_params": param_grid},
cv=5, n_jobs=-1)
clf.fit(X,y)
print clf.best_estimator_.estimator_
print clf.best_estimator_.ranking_
print clf.best_estimator_.score(X, y)
I'm receiving a DeprecationWarning as it appears the "estimator_params" parameter is being removed in 0.18; I'm trying to figure out the correct syntax to use in line 4.
Trying...
param_grid = [{'C': 0.001}, {'C': 0.01}, {'C': .1}, {'C': 1.0}, {'C': 10.0},
{'C': 100.0}, {'fit_intercept': True}, {'fit_intercept': False},
{'fit_intercept': 'l1'}, {'fit_intercept': 'l2'}]
clf = grid_search.GridSearchCV(selector, param_grid,
cv=5, n_jobs=-1)
Returns ValueError: Parameter values should be a list. And...
param_grid = {"penalty": ["l1","l2"],
"C": [.001,.01,.1,1,10,100],
"fit_intercept": [True, False]}
clf = grid_search.GridSearchCV(selector, param_grid,
cv=5, n_jobs=-1)
Returns ValueError: Invalid parameter penalty for estimator RFECV. Check the list of available parameters with estimator.get_params().keys(). Checking the keys shows all 3 of "C", "fit_intercept" and "penalty" as parameter keys. Trying...
param_grid = {"estimator__C": [.001,.01,.1,1,10,100],
"estimator__fit_intercept": [True, False],
"estimator__penalty": ["l1","l2"]}
clf = grid_search.GridSearchCV(selector, param_grid,
cv=5, n_jobs=-1)
never completes execution so I'm guessing that type of parameter assignment is not supported.
As for now I'm setup to ignore the warnings but I'd like to update the code with the appropriate syntax for 0.18. Any assistance would be appreciated!
Answer to question previously posted on SO : https://stackoverflow.com/a/35560648/5336341. Thanks to Paulo Alves for the answer.
Relevant code:
params = {'estimator__max_depth': [1, 5, None],
'estimator__class_weight': ['balanced', None]}
estimator = DecisionTreeClassifier()
selector = RFECV(estimator, step=1, cv=3, scoring='accuracy')
clf = GridSearchCV(selector, params, cv=3)
clf.fit(X_train, y_train)
clf.best_estimator_.estimator_
To see more, use:
print(selector.get_params())