So I ran a very thorough GridSearch with 10-fold cross-val in an integrated pipeline in the following manner-
pipeline_rf = Pipeline([
('standardize', MinMaxScaler()),
('grid_search_lr', GridSearchCV(
RandomForestClassifier(),
param_grid={'bootstrap': [True],
'max_depth': [50, 100, 150, 200],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [100, 200, 500, 1000, 1500]},
cv=10,
n_jobs=-1,
scoring='roc_auc',
verbose=2,
refit=True
))
])
pipeline_rf.fit(X_train, y_train)
How should I go about extracting the best set of parameters?
You first need to get the gridSearchCV object from the pipeline, and then call best_params_ on it. This can be done by:
pipeline_rf.named_steps['grid_search_lr'].best_params_
Related
When I try to run a RandomForestClassifier with Pipeline and param_grid:
pipeline = Pipeline([("scaler" , StandardScaler()),
("rf",RandomForestClassifier())])
from sklearn.model_selection import GridSearchCV
param_grid = {
'max_depth': [4, 5, 10],
'max_features': [2, 3],
'min_samples_leaf': [3, 4, 5],
'n_estimators': [100, 200, 300]
}
# initialize
grid_pipeline = GridSearchCV(pipeline,param_grid,n_jobs=-1, verbose=1, cv=3, scoring='f1')
# fit
grid_pipeline.fit(X_train,y_train)
grid_pipeline.best_params_
I get the following error:
ValueError: Invalid parameter max_depth for estimator Pipeline(memory=None,
steps=[('scaler',
StandardScaler(copy=True, with_mean=True, with_std=True)),
('rf',
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
class_weight=None, criterion='gini',
max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None,
oob_score=False, random_state=None,
verbose=0, warm_start=False))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
Although I have reviewed the scikit learn documentation and several posts, I can't find the error in my code.
When you use a pipeline with GridSearchCV() you must include names in parameter keys. Just separate names from parameter names with a double underscore. In your case:
param_grid = {
'rf__max_depth': [4, 5, 10],
'rf__max_features': [2, 3],
'rf__min_samples_leaf': [3, 4, 5],
'rf__n_estimators': [100, 200, 300]
}
Example from sklearn documentation: https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html
I am trying to apply RandomizedSearchCV on a RegressorChain XGBoost model but I got an error : Invalid parameter learning_rate for estimator RegressorChain(base_estimator=XGBRegressor.
If I comment all the values in grid dict, it works otherwise it doesn't accept any param.
Same models (XGBRegressor and RegressorChain) are working fine alone. The RandomizedSearchCV is not accepting the the params in grid dict
# Setup the parameters grid
grid = {
'n_estimators': [100, 500, 1000],
'max_depth': [5, 10, 20, 30],
'max_features': ["auto", "sqrt"],
'eta': [0.09, 0.1, 0.2],
'booster': ["dart", "gblinear"]
}
clf = XGBRegressor(objective='reg:squarederror')
chain = RegressorChain(base_estimator=clf, order=[0, 1, 2, 3, 4,5])
# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=chain,
param_distributions=grid,
n_iter=10, # number of models to try
cv=5,
verbose=1,
random_state=42,
refit=True)
# Fit the RandomizedSearchCV version of clf
rs_clf.fit(X_train, y_train) # 'rs' is short
Since the XGBRegressor is the base_estimator of RegressorChain, the parameters of XGBRegressor become nested and must be addressed with base_estimator__xxx:
grid = {
'base_estimator__n_estimators': [100, 500, 1000],
'base_estimator__max_depth': [5, 10, 20, 30],
'base_estimator__max_features': ["auto", "sqrt"],
'base_estimator__eta': [0.09, 0.1, 0.2],
'base_estimator__booster': ["dart", "gblinear"]
}
I have defined my parameter grid and gridsearch here. The weird thing is, the output does not include any of the parameter options I set. E.g. max features has been set to auto.
Have I done something wrong?
from sklearn.grid_search import GridSearchCV
param_grid = {
'bootstrap': [True],
'max_depth': [90, 100, 110],
'max_features': [2, 3, 10, 20],
'min_samples_leaf': [3, 4, 5, 10],
'min_samples_split': [2, 5, 8, 10, 12],
'n_estimators': [10, 20, 50, 60, 70]
}
model = RandomForestClassifier()
# Instantiate the grid search model
best = GridSearchCV(estimator = model, param_grid = param_grid,
cv = 3, n_jobs = -1, verbose = 2)
best.fit(x, y.ravel())
You have to take the return value of the best.fit() function.
fitted_grid = best.fit(x, y.ravel())
best_classifier = fitted_grid.best_estimator_
best_parameters = fitted_grid.best_params_
I did not see that part in your code snippet, so maybe that's where you were missing something ?
I have applied a gridsearchCV with estimators of DecisionTreeClassifier, RandomForestClassifier, LogisticRegression, XGBClassifier used all of them in ensemble learning.
The result that gridSearchCV will give with all these estimators are different in my system and my friend's system with same data of testing and training, I don't know why?
We are using same data for training and testing but gridsearch is giving different result for these data in both the system, just want to know what should be changed to make the system to give same result on any system?
gs_dt = GridSearchCV(estimator=DecisionTreeClassifier(random_state=42,class_weight={1:10, 0:1}),
param_grid=[{'max_depth': [ 2, 4, 6, 8, 10],
'criterion':['gini','entropy'],
"max_features":["auto", None],
"max_leaf_nodes":[10,20,30,40]}],
scoring=scoring,
cv=10,
refit='recall')
gs_rf = GridSearchCV(estimator=RandomForestClassifier(n_jobs=-1, oob_score = True,class_weight={1: 10/11, 0: 1/11}),
param_grid=[{'max_depth': [4, 6, 8, 10, 12, 16, 20, None],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [2, 4, 8],
'min_samples_split': [10, 20]}],
scoring=scoring,
cv=10,
n_jobs=4,
refit='recall')
gs_lr = GridSearchCV(estimator=LogisticRegression(multi_class='ovr',random_state=42,class_weight={1:10, 0:1}),
param_grid=[{'C': [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1 ,1],
'penalty':['l1','l2']}],
scoring=scoring,
cv=10,
refit='recall')
gs_gb = GridSearchCV(estimator=XGBClassifier(n_jobs=-1),
param_grid=[{'learning_rate': [0.01, 0.05, 0.1, 0.2],
'max_depth': [4, 6, 8, 10, 12, 16, 20],
'min_samples_leaf': [4, 8, 12, 16, 20],
'max_features': ['auto', 'sqrt']}],
scoring=scoring,
cv=10,
n_jobs=4,
refit='recall')
For example first gridsearchcv gives this result on my system:
DecisionTreeClassifier(class_weight={1: 10, 0: 1}, criterion='gini',
max_depth=8, max_features=None, max_leaf_nodes=10,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=42,
splitter='best')
And on my friend's system it gives:
DecisionTreeClassifier(class_weight={0: 1, 1: 10}, criterion='gini',
max_depth=10, max_features=None, max_leaf_nodes=10,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=42, splitter='best')
Similarly I got different result on my and my friend's system.
I am trying to make time series predictions using XGBoost. (XGBRegressor)
I used GrindSearchCV like this:
parameters = {'nthread': [4],
'objective': ['reg:linear'],
'learning_rate': [0.01, 0.03, 0.05],
'max_depth': [3, 4, 5, 6, 7, 7],
'min_child_weight': [4],
'silent': [1],
'subsample': [1],
'colsample_bytree': [0.7, 0.8],
'n_estimators': [500]}
xgb_grid = GridSearchCV(xgb, parameters, cv=2, n_jobs=5,
verbose=True)
xgb_grid.fit(x_train, y_train,
eval_set=[(x_train, y_train), (x_test, y_test)],
early_stopping_rounds=100,
verbose=True)
print(xgb_grid.best_score_)
print(xgb_grid.best_params_)
And got those :
0.307153826086191
{'colsample_bytree': 0.7, 'learning_rate': 0.03, 'max_depth': 4, 'min_child_weight': 4, 'n_estimators': 500, 'nthread': 4, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}
I tried implementing those parameters and calculate the error. I got this:
MSE: 4.579726929529167
MAE: 1.6753722069363144
I know that an error of 1.6 is not very good for predictions. It has to be < 0.9.
I tried to micro adjust the parameters but I have not managed to reduce error more than that.
I found something about the date format, maybe that is the problem ? My data is like this : yyyy-MM-dd HH:mm.
I am new to machine learning and that's what I managed to do after some examples and tutorials. What should I do to lower it, or what should I search for to learn ?
I mention that I found various examples like this one, but I didn't understood, and of course it did not work.