GridSearchCV gives different result - python-3.x

I have applied a gridsearchCV with estimators of DecisionTreeClassifier, RandomForestClassifier, LogisticRegression, XGBClassifier used all of them in ensemble learning.
The result that gridSearchCV will give with all these estimators are different in my system and my friend's system with same data of testing and training, I don't know why?
We are using same data for training and testing but gridsearch is giving different result for these data in both the system, just want to know what should be changed to make the system to give same result on any system?
gs_dt = GridSearchCV(estimator=DecisionTreeClassifier(random_state=42,class_weight={1:10, 0:1}),
param_grid=[{'max_depth': [ 2, 4, 6, 8, 10],
'criterion':['gini','entropy'],
"max_features":["auto", None],
"max_leaf_nodes":[10,20,30,40]}],
scoring=scoring,
cv=10,
refit='recall')
gs_rf = GridSearchCV(estimator=RandomForestClassifier(n_jobs=-1, oob_score = True,class_weight={1: 10/11, 0: 1/11}),
param_grid=[{'max_depth': [4, 6, 8, 10, 12, 16, 20, None],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [2, 4, 8],
'min_samples_split': [10, 20]}],
scoring=scoring,
cv=10,
n_jobs=4,
refit='recall')
gs_lr = GridSearchCV(estimator=LogisticRegression(multi_class='ovr',random_state=42,class_weight={1:10, 0:1}),
param_grid=[{'C': [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1 ,1],
'penalty':['l1','l2']}],
scoring=scoring,
cv=10,
refit='recall')
gs_gb = GridSearchCV(estimator=XGBClassifier(n_jobs=-1),
param_grid=[{'learning_rate': [0.01, 0.05, 0.1, 0.2],
'max_depth': [4, 6, 8, 10, 12, 16, 20],
'min_samples_leaf': [4, 8, 12, 16, 20],
'max_features': ['auto', 'sqrt']}],
scoring=scoring,
cv=10,
n_jobs=4,
refit='recall')
For example first gridsearchcv gives this result on my system:
DecisionTreeClassifier(class_weight={1: 10, 0: 1}, criterion='gini',
max_depth=8, max_features=None, max_leaf_nodes=10,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=42,
splitter='best')
And on my friend's system it gives:
DecisionTreeClassifier(class_weight={0: 1, 1: 10}, criterion='gini',
max_depth=10, max_features=None, max_leaf_nodes=10,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=42, splitter='best')
Similarly I got different result on my and my friend's system.

Related

Check the list of available parameters with `estimator.get_params().keys()`

When I try to run a RandomForestClassifier with Pipeline and param_grid:
pipeline = Pipeline([("scaler" , StandardScaler()),
("rf",RandomForestClassifier())])
from sklearn.model_selection import GridSearchCV
param_grid = {
'max_depth': [4, 5, 10],
'max_features': [2, 3],
'min_samples_leaf': [3, 4, 5],
'n_estimators': [100, 200, 300]
}
# initialize
grid_pipeline = GridSearchCV(pipeline,param_grid,n_jobs=-1, verbose=1, cv=3, scoring='f1')
# fit
grid_pipeline.fit(X_train,y_train)
grid_pipeline.best_params_
I get the following error:
ValueError: Invalid parameter max_depth for estimator Pipeline(memory=None,
steps=[('scaler',
StandardScaler(copy=True, with_mean=True, with_std=True)),
('rf',
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
class_weight=None, criterion='gini',
max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None,
oob_score=False, random_state=None,
verbose=0, warm_start=False))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
Although I have reviewed the scikit learn documentation and several posts, I can't find the error in my code.
When you use a pipeline with GridSearchCV() you must include names in parameter keys. Just separate names from parameter names with a double underscore. In your case:
param_grid = {
'rf__max_depth': [4, 5, 10],
'rf__max_features': [2, 3],
'rf__min_samples_leaf': [3, 4, 5],
'rf__n_estimators': [100, 200, 300]
}
Example from sklearn documentation: https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html

Why hyperopt is giving the best loss Nan while operating in Random Forest?

I am solving a Kaggle Problem: https://www.kaggle.com/c/forest-cover-type-prediction/data
I used hyperopt to find optimal hyperparameter for Random Forest. But I am stuck here, as for almost most of the iteration it is giving best loss: Nan.
My Full Code :
import pandas as pd
import numpy as np
# Lets Load the Dataset
train = pd.read_csv(r"D:\Study Material\Py_Programs\Data Sets\forest-cover-type-prediction\train.csv")
test = pd.read_csv(r"D:\Study Material\Py_Programs\Data Sets\forest-cover-type-prediction\test.csv")
# Lets Append all together so that we can study altogether
# Lets not include test_2 for a while
test['Cover_Type'] = np.nan
data = train.append(test,ignore_index = True)
del train,test
# Lets Now do Feature Engineering
# We could do Manula Feature Engineering but lets not do it
# Lets use feature Tools
# Lets first create a simple new attribute that can later be index for Soil Enity
data['Id_Soil'] = np.arange(len(data))
import featuretools as ft
es = ft.EntitySet(id = 'forest')
es = es.entity_from_dataframe(entity_id = 'Forest_Pred',dataframe = data,index = 'Id')
>>> es
Entityset: forest
Entities:
Forest_Pred [Rows: 581012, Columns: 63]
Relationships:
No relationships
# Lets Make a Seperate Entity for Soil
Additional_Variable = data.columns[data.columns.str.startswith('Soil')]
Additional_Variable
es = es.normalize_entity(base_entity_id = 'Forest_Pred',new_entity_id = 'Soil',index = 'Id_Soil',additional_variables =
list(Additional_Variable))
>>>es
Entityset: forest
Entities:
Forest_Pred [Rows: 581012, Columns: 23]
Soil [Rows: 581012, Columns: 41]
Relationships:
Forest_Pred.Id_Soil -> Soil.Id_Soil
# Lets Run DFS
feature_matrix,feature_defs = ft.dfs(entityset = es,target_entity = 'Forest_Pred')
drop_cols = []
for col in feature_matrix:
if col == 'Cover_Type':
pass
else:
if 'Cover_Type' in col:
drop_cols.append(col)
feature_matrix = feature_matrix[[x for x in feature_matrix if x not in drop_cols]]
feature_matrix.head()
# Create correlation matrix
corr_matrix = feature_matrix.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] >= 0.95)]
print('There are {} columns with >= 0.95 correlation.'.format(len(to_drop)))
>>>to_drop
There are 83 columns with >= 0.95 correlation.
# These are the Redundant Columns
['New2', # I manually created New2, Hill1 and Hill3
'Hill1',
'Hill3',
'Soil.SUM(Forest_Pred.wild1)',
'Soil.SUM(Forest_Pred.Vertical_Distance_To_Hydrology)',
'Soil.SUM(Forest_Pred.Hillshade_Noon)',
'Soil.SUM(Forest_Pred.Horizontal_Distance_To_Roadways)',
'Soil.SUM(Forest_Pred.Slope)',
'Soil.SUM(Forest_Pred.Wilderness_Area4)',
'Soil.SUM(Forest_Pred.New4)',
'Soil.SUM(Forest_Pred.Hill2)',
'Soil.SUM(Forest_Pred.New2)',
'Soil.SUM(Forest_Pred.Wilderness_Area2)',
'Soil.SUM(Forest_Pred.Horizontal_Distance_To_Hydrology)',
'Soil.SUM(Forest_Pred.Hillshade_9am)',
'Soil.SUM(Forest_Pred.Aspect)',
'Soil.SUM(Forest_Pred.Hillshade_3pm)',
'Soil.SUM(Forest_Pred.Hill1)',
'Soil.SUM(Forest_Pred.Hill3)',
'Soil.SUM(Forest_Pred.Elevation)',
'Soil.SUM(Forest_Pred.Wilderness_Area3)',
'Soil.SUM(Forest_Pred.Horizontal_Distance_To_Fire_Points)',
'Soil.SUM(Forest_Pred.Wilderness_Area1)',
'Soil.MAX(Forest_Pred.wild1)',
'Soil.MAX(Forest_Pred.Vertical_Distance_To_Hydrology)',
'Soil.MAX(Forest_Pred.Hillshade_Noon)',
'Soil.MAX(Forest_Pred.Horizontal_Distance_To_Roadways)',
'Soil.MAX(Forest_Pred.Slope)',
'Soil.MAX(Forest_Pred.Wilderness_Area4)',
'Soil.MAX(Forest_Pred.New4)',
'Soil.MAX(Forest_Pred.Hill2)',
'Soil.MAX(Forest_Pred.New2)',
'Soil.MAX(Forest_Pred.Wilderness_Area2)',
'Soil.MAX(Forest_Pred.Horizontal_Distance_To_Hydrology)',
'Soil.MAX(Forest_Pred.Hillshade_9am)',
'Soil.MAX(Forest_Pred.Aspect)',
'Soil.MAX(Forest_Pred.Hillshade_3pm)',
'Soil.MAX(Forest_Pred.Hill1)',
'Soil.MAX(Forest_Pred.Hill3)',
'Soil.MAX(Forest_Pred.Elevation)',
'Soil.MAX(Forest_Pred.Wilderness_Area3)',
'Soil.MAX(Forest_Pred.Horizontal_Distance_To_Fire_Points)',
'Soil.MAX(Forest_Pred.Wilderness_Area1)',
'Soil.MIN(Forest_Pred.wild1)',
'Soil.MIN(Forest_Pred.Vertical_Distance_To_Hydrology)',
'Soil.MIN(Forest_Pred.Hillshade_Noon)',
'Soil.MIN(Forest_Pred.Horizontal_Distance_To_Roadways)',
'Soil.MIN(Forest_Pred.Slope)',
'Soil.MIN(Forest_Pred.Wilderness_Area4)',
'Soil.MIN(Forest_Pred.New4)',
'Soil.MIN(Forest_Pred.Hill2)',
'Soil.MIN(Forest_Pred.New2)',
'Soil.MIN(Forest_Pred.Wilderness_Area2)',
'Soil.MIN(Forest_Pred.Horizontal_Distance_To_Hydrology)',
'Soil.MIN(Forest_Pred.Hillshade_9am)',
'Soil.MIN(Forest_Pred.Aspect)',
'Soil.MIN(Forest_Pred.Hillshade_3pm)',
'Soil.MIN(Forest_Pred.Hill1)',
'Soil.MIN(Forest_Pred.Hill3)',
'Soil.MIN(Forest_Pred.Elevation)',
'Soil.MIN(Forest_Pred.Wilderness_Area3)',
'Soil.MIN(Forest_Pred.Horizontal_Distance_To_Fire_Points)',
'Soil.MIN(Forest_Pred.Wilderness_Area1)',
'Soil.MEAN(Forest_Pred.wild1)',
'Soil.MEAN(Forest_Pred.Vertical_Distance_To_Hydrology)',
'Soil.MEAN(Forest_Pred.Hillshade_Noon)',
'Soil.MEAN(Forest_Pred.Horizontal_Distance_To_Roadways)',
'Soil.MEAN(Forest_Pred.Slope)',
'Soil.MEAN(Forest_Pred.Wilderness_Area4)',
'Soil.MEAN(Forest_Pred.New4)',
'Soil.MEAN(Forest_Pred.Hill2)',
'Soil.MEAN(Forest_Pred.New2)',
'Soil.MEAN(Forest_Pred.Wilderness_Area2)',
'Soil.MEAN(Forest_Pred.Horizontal_Distance_To_Hydrology)',
'Soil.MEAN(Forest_Pred.Hillshade_9am)',
'Soil.MEAN(Forest_Pred.Aspect)',
'Soil.MEAN(Forest_Pred.Hillshade_3pm)',
'Soil.MEAN(Forest_Pred.Hill1)',
'Soil.MEAN(Forest_Pred.Hill3)',
'Soil.MEAN(Forest_Pred.Elevation)',
'Soil.MEAN(Forest_Pred.Wilderness_Area3)',
'Soil.MEAN(Forest_Pred.Horizontal_Distance_To_Fire_Points)',
'Soil.MEAN(Forest_Pred.Wilderness_Area1)']
# Lets get the feature first
# Lets Now Look at the NULL Values
Null_Values = pd.DataFrame(train.isnull().sum()).rename(columns = {0 : 'Total'})
Null_Values['Percentage'] = Null_Values['Total']/len(train)
Null_Values.sort_values('Percentage',ascending = False)
Fully_Null_Columns = Null_Values.loc[Null_Values['Percentage'] == 1.0]
To_Remove = Fully_Null_Columns.index
Feature = list(train.columns)
for Val in To_Remove:
Feature.remove(Val)
>>>len(Feature)
58
Pipe = Pipeline([
('impute',Imputer(strategy = 'median')),
('scaler',MinMaxScaler())
])
train = Pipe.fit_transform(train)
test = Pipe.transform(test)
######################## Hyperopt Part Begins From Here ###############################333
# Lets Apply Hyperopt to Optimize the Two Model that we think may do good Random Forest and MLP
# lETS fIRST dO For Random Forest
#Lets Define The Objective Function for it
from hyperopt import STATUS_OK
def Objective_Forest(params):
classifier = RandomForestClassifier(**params)
score = cross_val_score(classifier,train,Target,cv = 10,scoring = scorer)
Best_Score = 1 - np.mean(score)
return {'loss': Best_Score,'params':params,'status':STATUS_OK}
# lETS DEFINE THE PARAMETER SPACE FOR THE RANDOM FOREST CLASSIFIER
from hyperopt import hp
Param_grid = {
'n_estimators': hp.choice('n_estimators',range(10,1000)),
'max_depth' : hp.choice('max_depth',range(1,20)),
'min_samples_split': hp.choice('min_samples_split',range(2,20)),
'min_samples_leaf': hp.choice('min_samples_leaf',range(1,11)),
'min_weight_fraction_leaf': hp.uniform('min_weight_fraction_leaf',0.0,1.0),
'max_features':hp.choice('max_features',["sqrt", "log2","None",0.2,0.5,0.8]),
'max_leaf_nodes':hp.choice('max_leaf_nodes',range(10,150)),
'min_impurity_decrease':hp.uniform('min_impurity_decrease',0.0,1.0),
'class_weight':hp.choice('class_weight',[None,'balanced']),
'max_samples':hp.uniform('max_samples',0.0,1.0)
}
from hyperopt import tpe
tpe_algo = tpe.suggest
from hyperopt import Trials
bayes_trials = Trials()
from hyperopt import fmin
MAX_EVALS = 100
# Optimize
best = fmin(fn = Objective_Forest,space = Param_grid,algo = tpe_algo,max_evals = MAX_EVALS,trials = bayes_trials)
>>> [print(t['result'],end = '\n\n\n') for t in bayes_trials.trials]
{'loss': nan, 'params': {'class_weight': 'balanced', 'max_depth': 1, 'max_features': 'None', 'max_leaf_nodes': 33, 'max_samples': 0.4660771469206677, 'min_impurity_decrease': 0.45511437833393464, 'min_samples_leaf': 10, 'min_samples_split': 8, 'min_weight_fraction_leaf': 0.9339453161850745, 'n_estimators': 972}, 'status': 'ok'}
{'loss': nan, 'params': {'class_weight': None, 'max_depth': 11, 'max_features': 'log2', 'max_leaf_nodes': 49, 'max_samples': 0.14947278280397347, 'min_impurity_decrease': 0.2358674422658822, 'min_samples_leaf': 9, 'min_samples_split': 16, 'min_weight_fraction_leaf': 0.5935700756502073, 'n_estimators': 436}, 'status': 'ok'}
{'loss': nan, 'params': {'class_weight': None, 'max_depth': 16, 'max_features': 'None', 'max_leaf_nodes': 64, 'max_samples': 0.008126252217055763, 'min_impurity_decrease': 0.5860665211910298, 'min_samples_leaf': 3, 'min_samples_split': 14, 'min_weight_fraction_leaf': 0.7589621329866701, 'n_estimators': 544}, 'status': 'ok'}
{'loss': nan, 'params': {'class_weight': None, 'max_depth': 13, 'max_features': 'None', 'max_leaf_nodes': 88, 'max_samples': 0.8342507642254701, 'min_impurity_decrease': 0.29169826447891134, 'min_samples_leaf': 9, 'min_samples_split': 14, 'min_weight_fraction_leaf': 0.5732868446872494, 'n_estimators': 759}, 'status': 'ok'}
{'loss': 0.514714207538852, 'params': {'class_weight': None, 'max_depth': 4, 'max_features': 'sqrt', 'max_leaf_nodes': 104, 'max_samples': 0.10435155448150135, 'min_impurity_decrease': 0.024801820935633656, 'min_samples_leaf': 5, 'min_samples_split': 10, 'min_weight_fraction_leaf': 0.09350127980207612, 'n_estimators': 739}, 'status': 'ok'}
{'loss': 0.9642857142857143, 'params': {'class_weight': 'balanced', 'max_depth': 5, 'max_features': 'log2', 'max_leaf_nodes': 86, 'max_samples': 0.029032222646389272, 'min_impurity_decrease': 0.4459819146508117, 'min_samples_leaf': 5, 'min_samples_split': 10, 'min_weight_fraction_leaf': 0.16673304793166255, 'n_estimators': 419}, 'status': 'ok'}
{'loss': nan, 'params': {'class_weight': 'balanced', 'max_depth': 1, 'max_features': 'None', 'max_leaf_nodes': 18, 'max_samples': 0.4913763122828826, 'min_impurity_decrease': 0.35382231135300235, 'min_samples_leaf': 3, 'min_samples_split': 18, 'min_weight_fraction_leaf': 0.7421569901774066, 'n_estimators': 354}, 'status': 'ok'}
{'loss': nan, 'params': {'class_weight': 'balanced', 'max_depth': 4, 'max_features': 'sqrt', 'max_leaf_nodes': 69, 'max_samples': 0.27201985914939086, 'min_impurity_decrease': 0.486936153640398, 'min_samples_leaf': 8, 'min_samples_split': 15, 'min_weight_fraction_leaf': 0.7310520866089266, 'n_estimators': 142}, 'status': 'ok'}
{'loss': nan, 'params': {'class_weight': None, 'max_depth': 12, 'max_features': 'sqrt', 'max_leaf_nodes': 36, 'max_samples': 0.9771715541709761, 'min_impurity_decrease': 0.1971412468087903, 'min_samples_leaf': 9, 'min_samples_split': 3, 'min_weight_fraction_leaf': 0.8200016570398415, 'n_estimators': 34}, 'status': 'ok'}
{'loss': 0.9642857142857143, 'params': {'class_weight': None, 'max_depth': 10, 'max_features': 'sqrt', 'max_leaf_nodes': 73, 'max_samples': 0.45641569744506405, 'min_impurity_decrease': 0.8403030256419523, 'min_samples_leaf': 7, 'min_samples_split': 9, 'min_weight_fraction_leaf': 0.0701815156303528, 'n_estimators': 873}, 'status': 'ok'}
{'loss': 0.9642857142857143, 'params': {'class_weight': None, 'max_depth': 17, 'max_features': 'sqrt', 'max_leaf_nodes': 46, 'max_samples': 0.15866300388832533, 'min_impurity_decrease': 0.9297347852530089, 'min_samples_leaf': 7, 'min_samples_split': 6, 'min_weight_fraction_leaf': 0.18404233693328886, 'n_estimators': 121}, 'status': 'ok'}
{'loss': nan, 'params': {'class_weight': 'balanced', 'max_depth': 7, 'max_features': 'None', 'max_leaf_nodes': 104, 'max_samples': 0.0367072640631847, 'min_impurity_decrease': 0.12910648344978914, 'min_samples_leaf': 2, 'min_samples_split': 15, 'min_weight_fraction_leaf': 0.3161712810846662, 'n_estimators': 767}, 'status': 'ok'}
{'loss': nan, 'params': {'class_weight': 'balanced', 'max_depth': 3, 'max_features': 'None', 'max_leaf_nodes': 124, 'max_samples': 0.16440865223966705, 'min_impurity_decrease': 0.391904635576072, 'min_samples_leaf': 1, 'min_samples_split': 7, 'min_weight_fraction_leaf': 0.0811356314154057, 'n_estimators': 347}, 'status': 'ok'}
{'loss': nan, 'params': {'class_weight': 'balanced', 'max_depth': 12, 'max_features': 'log2', 'max_leaf_nodes': 68, 'max_samples': 0.8502406812728349, 'min_impurity_decrease': 0.7058978690401395, 'min_samples_leaf': 2, 'min_samples_split': 16, 'min_weight_fraction_leaf': 0.7016784424128134, 'n_estimators': 938}, 'status': 'ok'}
{'loss': nan, 'params': {'class_weight': 'balanced', 'max_depth': 5, 'max_features': 'log2', 'max_leaf_nodes': 99, 'max_samples': 0.23705851369580344, 'min_impurity_decrease': 0.20836965887913506, 'min_samples_leaf': 7, 'min_samples_split': 3, 'min_weight_fraction_leaf': 0.7453528956610014, 'n_estimators': 468}, 'status': 'ok'}
{'loss': nan, 'params': {'class_weight': None, 'max_depth': 15, 'max_features': 'None', 'max_leaf_nodes': 114, 'max_samples': 0.7084444118326696, 'min_impurity_decrease': 0.986092424730284, 'min_samples_leaf': 3, 'min_samples_split': 14, 'min_weight_fraction_leaf': 0.30715124274867167, 'n_estimators': 743}, 'status': 'ok'}
{'loss': 0.9642857142857143, 'params': {'class_weight': 'balanced', 'max_depth': 10, 'max_features': 'sqrt', 'max_leaf_nodes': 97, 'max_samples': 0.9199683481619908, 'min_impurity_decrease': 0.34148971488668467, 'min_samples_leaf': 5, 'min_samples_split': 10, 'min_weight_fraction_leaf': 0.006984816385200432, 'n_estimators': 386}, 'status': 'ok'}
{'loss': nan, 'params': {'class_weight': None, 'max_depth': 13, 'max_features': 'None', 'max_leaf_nodes': 20, 'max_samples': 0.38036460187991084, 'min_impurity_decrease': 0.8852038598514178, 'min_samples_leaf': 5, 'min_samples_split': 11, 'min_weight_fraction_leaf': 0.06166031048348186, 'n_estimators': 635}, 'status': 'ok'}
{'loss': nan, 'params': {'class_weight': 'balanced', 'max_depth': 5, 'max_features': 'None', 'max_leaf_nodes': 52, 'max_samples': 0.8640312159272309, 'min_impurity_decrease': 0.16823848137945396, 'min_samples_leaf': 1, 'min_samples_split': 9, 'min_weight_fraction_leaf': 0.24162088495434908, 'n_estimators': 564}, 'status': 'ok'}
{'status': 'new'}
[None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None]
I executed the 'fmin' for full iteration but same result. What am I doing Wrong here ???
Your loss in Objective_Forest is defined as (1 - np.mean(score), and score is evaluated by cross validation with cross_val_score. The loss, therefore, depends on the output of your cross_val_score function. You are using scorer as the evaluation metric in cross_val_score, but you have not defined it anywhere in your code (was probably defined somewhere else). Your NaN values are most likely due to the kind of scoring you are implementing in your cross validation.
def Objective_Forest(params):
classifier = RandomForestClassifier(**params)
score = cross_val_score(classifier,train,Target,cv = 10,scoring = scorer)
Best_Score = 1 - np.mean(score)
return {'loss': Best_Score,'params':params,'status':STATUS_OK}

GridSearchCV returning results not in the param grid

I have defined my parameter grid and gridsearch here. The weird thing is, the output does not include any of the parameter options I set. E.g. max features has been set to auto.
Have I done something wrong?
from sklearn.grid_search import GridSearchCV
param_grid = {
'bootstrap': [True],
'max_depth': [90, 100, 110],
'max_features': [2, 3, 10, 20],
'min_samples_leaf': [3, 4, 5, 10],
'min_samples_split': [2, 5, 8, 10, 12],
'n_estimators': [10, 20, 50, 60, 70]
}
model = RandomForestClassifier()
# Instantiate the grid search model
best = GridSearchCV(estimator = model, param_grid = param_grid,
cv = 3, n_jobs = -1, verbose = 2)
best.fit(x, y.ravel())
You have to take the return value of the best.fit() function.
fitted_grid = best.fit(x, y.ravel())
best_classifier = fitted_grid.best_estimator_
best_parameters = fitted_grid.best_params_
I did not see that part in your code snippet, so maybe that's where you were missing something ?

How to optimize xgb regression model?

I am trying to make time series predictions using XGBoost. (XGBRegressor)
I used GrindSearchCV like this:
parameters = {'nthread': [4],
'objective': ['reg:linear'],
'learning_rate': [0.01, 0.03, 0.05],
'max_depth': [3, 4, 5, 6, 7, 7],
'min_child_weight': [4],
'silent': [1],
'subsample': [1],
'colsample_bytree': [0.7, 0.8],
'n_estimators': [500]}
xgb_grid = GridSearchCV(xgb, parameters, cv=2, n_jobs=5,
verbose=True)
xgb_grid.fit(x_train, y_train,
eval_set=[(x_train, y_train), (x_test, y_test)],
early_stopping_rounds=100,
verbose=True)
print(xgb_grid.best_score_)
print(xgb_grid.best_params_)
And got those :
0.307153826086191
{'colsample_bytree': 0.7, 'learning_rate': 0.03, 'max_depth': 4, 'min_child_weight': 4, 'n_estimators': 500, 'nthread': 4, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}
I tried implementing those parameters and calculate the error. I got this:
MSE: 4.579726929529167
MAE: 1.6753722069363144
I know that an error of 1.6 is not very good for predictions. It has to be < 0.9.
I tried to micro adjust the parameters but I have not managed to reduce error more than that.
I found something about the date format, maybe that is the problem ? My data is like this : yyyy-MM-dd HH:mm.
I am new to machine learning and that's what I managed to do after some examples and tutorials. What should I do to lower it, or what should I search for to learn ?
I mention that I found various examples like this one, but I didn't understood, and of course it did not work.

Extract best_params after running GridSearch within a pipeline

So I ran a very thorough GridSearch with 10-fold cross-val in an integrated pipeline in the following manner-
pipeline_rf = Pipeline([
('standardize', MinMaxScaler()),
('grid_search_lr', GridSearchCV(
RandomForestClassifier(),
param_grid={'bootstrap': [True],
'max_depth': [50, 100, 150, 200],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [100, 200, 500, 1000, 1500]},
cv=10,
n_jobs=-1,
scoring='roc_auc',
verbose=2,
refit=True
))
])
pipeline_rf.fit(X_train, y_train)
How should I go about extracting the best set of parameters?
You first need to get the gridSearchCV object from the pipeline, and then call best_params_ on it. This can be done by:
pipeline_rf.named_steps['grid_search_lr'].best_params_

Resources