Variable names for my training and test data are X_train, X_test, Y_train, Y_test
I have ran a GridSearchCV instance from sklearn to do hyperparameter tuning for my random forest model.
param_grid = {
'n_estimators': [500],
'max_features': ['sqrt', None],
'max_depth': [ 6 ],
'max_leaf_nodes': [8],
'min_impurity_decrease':[0,0.02],
'min_samples_split':[2]
}
grid_search= GridSearchCV(RandomForestClassifier(
criterion='gini',
min_weight_fraction_leaf=0.0,
bootstrap=True,
n_jobs=-1,
random_state=1, verbose=0,
warm_start=False, class_weight='balanced',
ccp_alpha=0.0,
max_samples=None),
param_grid=param_grid,verbose=50,cv=2,n_jobs=-1,scoring='balanced_accuracy')
grid_search.fit(X_train, Y_train)
All the scores that I can see while the gridseach is training are in the range of 0.4-.6
Following is the output of the best score:
[CV 2/2; 1/4] END max_depth=6, max_features=sqrt, max_leaf_nodes=8, min_impurity_decrease=0, min_samples_split=2, n_estimators=500;, score=0.552 total time= 15.4s
My questions is when I am manually calculating balanced_accuracy using
from sklearn.metrics import balanced_accuracy_score, by running print('training accuracy', balanced_accuracy_score(grid_search.predict(X_train), Y_train,adjusted=False)), I am getting a value of about 0.96 which is very different from what the output of gridsearchcv is showing during the run. Why is this so? And what does the score in gridsearchcv mean then? Please note I have passed the parameter scoring = 'balanced_accuracy' in gridsearchcv to make sure they calculate the same thing.
The score you get from gridsearchcv is the validation score (score measured on the part of X_train not used to train the model).
Your manually calculated score is the training score (you fit the model and evaluate the score on the same data: X_train).
The high difference is a sign of overfitting.
You can try to change param_grid :
param_grid = {
'n_estimators': [100, 200], # 500 seems high and might take too long for no reason
'max_features': ['sqrt','log2', None], # Less features can reduce overfitting
'max_depth': [3, 4, 5, 6 ], # Lower depth can reduce overfitting
'max_leaf_nodes': [4, 6, 8], # Lower max_leaf_nodes can reduce overfitting
'min_impurity_decrease':[0,0.02],
'min_samples_split':[2],
'min_samples_leaf': [5, 10, 20] # Higher values can reduce overfitting
}
Also using cv=3 or cv=5 in GridSearchCV could help.
See this post about solving Random Forest overfitting.
I have a multiclass classification problem. In the cross-validation grid-search to find the best (hyper)parameter settings, I found that the random forest is extremely under-performing (accuracy=0.412, while other ML algo reached 0.70 or higher). I understand this is not necessary a red flag as different ML algorithms may perform best/worst in different problem solution space. But I am wondering if I am not setting the range of its possible hyper-parameter correctly.
ml_algo_param_dict = \
{
'LR_V1': { 'clf': LogisticRegression(),
'param': {
'logisticregression__solver': ['liblinear'],
'logisticregression__penalty': ['l1', 'l2'],
'logisticregression__C': np.logspace(-4, 4, 20),
'logisticregression__tol': np.logspace(-5, 5, 20),
'logisticregression__class_weight': [None, 'balanced'],
'logisticregression__multi_class': ['ovr', 'auto'],
'logisticregression__max_iter': [4000, 20000],
}},
'LR_V2': { 'clf': LogisticRegression(),
'param': {
'logisticregression__solver': ['newton-cg', 'lbfgs', 'sag', 'saga'],
'logisticregression__penalty': ['none', 'l2'],
'logisticregression__C': np.logspace(-4, 4, 20),
'logisticregression__tol': np.logspace(-5, 5, 20),
'logisticregression__class_weight': [None, 'balanced'],
'logisticregression__multi_class': ['ovr', 'multinomial', 'auto'],
'logisticregression__max_iter': [4000, 20000],
}},
'SVC': { 'clf': OneVsRestClassifier(LinearSVC()),
'param': {
'onevsrestclassifier__estimator__penalty': ['l2'],
'onevsrestclassifier__estimator__loss': ['hinge', 'squared_hinge'],
'onevsrestclassifier__estimator__C': np.logspace(-4, 4, 20),
'onevsrestclassifier__estimator__tol': np.logspace(-5, 5, 20),
'onevsrestclassifier__estimator__class_weight': [None, 'balanced'],
'onevsrestclassifier__estimator__multi_class': ['ovr', 'crammer_singer'],
'onevsrestclassifier__estimator__max_iter': [50, 1000, 4000, 20000],
}},
'RF': {'clf': RandomForestClassifier(),
'param': {
'randomforestclassifier__n_estimators': [1, 2, 4, 8, 16, 32, 64, 100, 200, 500, ],
'randomforestclassifier__criterion': ['gini', 'entropy'],
'randomforestclassifier__class_weight': [None, 'balanced', 'balanced_subsample'],
'randomforestclassifier__max_depth': np.linspace(1, 10, 32, endpoint=True),
'randomforestclassifier__min_samples_split': np.linspace(0.1, 1.0, 10, endpoint=True),
'randomforestclassifier__min_samples_leaf': np.linspace(0.1, 0.5, 5, endpoint=True),
'randomforestclassifier__max_leaf_nodes': [None, 50, 100, 200, 400],
'randomforestclassifier__max_features': [None, 'auto', 'sqrt', 'log2'],
}},
'NB': {'clf': BernoulliNB(),
'param': {
'bernoullinb__alpha': np.logspace(-4, 4, 20),
'bernoullinb__binarize': [None, 0, .2, .4, .6, .8, 1],
'bernoullinb__fit_prior': [True, False],
}},
}
Result
>> Best score: 0.712
>> Best parameter:
Pipeline(memory=None,
steps=[('columntransformer',
ColumnTransformer(n_jobs=None, remainder='drop',
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('num',
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(add_indicator=False,
copy=True,
fill_value=None,
missing_values=nan,
strategy='median',
verbose=0)),
('scaler',
StandardScaler(copy=True,
with...
verbose=False),
['LOC_ENTITY_LIST'])],
verbose=False)),
('logisticregression',
LogisticRegression(C=0.03359818286283781,
class_weight='balanced', dual=False,
fit_intercept=True, intercept_scaling=1,
l1_ratio=None, max_iter=4000,
multi_class='ovr', n_jobs=None,
penalty='l2', random_state=None,
solver='liblinear', tol=1e-05, verbose=0,
warm_start=False))],
verbose=False)
>> Best selected parameter:
{'logisticregression__tol': 1e-05, 'logisticregression__solver': 'liblinear', 'logisticregression__penalty': 'l2', 'logisticregression__multi_class': 'ovr', 'logisticregression__max_iter': 4000, 'logisticregression__class_weight': 'balanced', 'logisticregression__C': 0.03359818286283781}
>> Best score: 0.738
>> Best parameter:
Pipeline(memory=None,
steps=[('columntransformer',
ColumnTransformer(n_jobs=None, remainder='drop',
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('num',
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(add_indicator=False,
copy=True,
fill_value=None,
missing_values=nan,
strategy='median',
verbose=0)),
('scaler',
StandardScaler(copy=True,
with...
verbose=False),
['LOC_ENTITY_LIST'])],
verbose=False)),
('logisticregression',
LogisticRegression(C=0.23357214690901212, class_weight=None,
dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None,
max_iter=20000, multi_class='ovr',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs',
tol=0.01438449888287663, verbose=0,
warm_start=False))],
verbose=False)
>> Best selected parameter:
{'logisticregression__tol': 0.01438449888287663, 'logisticregression__solver': 'lbfgs', 'logisticregression__penalty': 'l2', 'logisticregression__multi_class': 'ovr', 'logisticregression__max_iter': 20000, 'logisticregression__class_weight': None, 'logisticregression__C': 0.23357214690901212}
>> Best score: 0.708
>> Best parameter:
Pipeline(memory=None,
steps=[('columntransformer',
ColumnTransformer(n_jobs=None, remainder='drop',
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('num',
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(add_indicator=False,
copy=True,
fill_value=None,
missing_values=nan,
strategy='median',
verbose=0)),
('scaler',
StandardScaler(copy=True,
with...
verbose=False),
['LOC_ENTITY_LIST'])],
verbose=False)),
('onevsrestclassifier',
OneVsRestClassifier(estimator=LinearSVC(C=78.47599703514607,
class_weight='balanced',
dual=True,
fit_intercept=True,
intercept_scaling=1,
loss='hinge',
max_iter=4000,
multi_class='ovr',
penalty='l2',
random_state=None,
tol=3.359818286283781e-05,
verbose=0),
n_jobs=None))],
verbose=False)
>> Best selected parameter:
{'onevsrestclassifier__estimator__tol': 3.359818286283781e-05, 'onevsrestclassifier__estimator__penalty': 'l2', 'onevsrestclassifier__estimator__multi_class': 'ovr', 'onevsrestclassifier__estimator__max_iter': 4000, 'onevsrestclassifier__estimator__loss': 'hinge', 'onevsrestclassifier__estimator__class_weight': 'balanced', 'onevsrestclassifier__estimator__C': 78.47599703514607}
>> Best score: 0.412
>> Best parameter:
Pipeline(memory=None,
steps=[('columntransformer',
ColumnTransformer(n_jobs=None, remainder='drop',
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('num',
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(add_indicator=False,
copy=True,
fill_value=None,
missing_values=nan,
strategy='median',
verbose=0)),
('scaler',
StandardScaler(copy=True,
with...
RandomForestClassifier(bootstrap=True, class_weight=None,
criterion='gini',
max_depth=6.806451612903226,
max_features=None, max_leaf_nodes=50,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=0.2,
min_samples_split=0.2,
min_weight_fraction_leaf=0.0,
n_estimators=200, n_jobs=None,
oob_score=False, random_state=None,
verbose=0, warm_start=False))],
verbose=False)
>> Best selected parameter:
{'randomforestclassifier__n_estimators': 200, 'randomforestclassifier__min_samples_split': 0.2, 'randomforestclassifier__min_samples_leaf': 0.2, 'randomforestclassifier__max_leaf_nodes': 50, 'randomforestclassifier__max_features': None, 'randomforestclassifier__max_depth': 6.80645161290322 6, 'randomforestclassifier__criterion': 'gini', 'randomforestclassifier__class_weight': None}
>> Best score: 0.697
>> Best parameter:
Pipeline(memory=None,
steps=[('columntransformer',
ColumnTransformer(n_jobs=None, remainder='drop',
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('num',
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(add_indicator=False,
copy=True,
fill_value=None,
missing_values=nan,
strategy='median',
verbose=0)),
('scaler',
StandardScaler(copy=True,
with...
lowercase=True,
max_df=1.0,
max_features=5000,
min_df=1,
ngram_range=(1,
1),
preprocessor=None,
stop_words=None,
strip_accents=None,
token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None,
vocabulary=None))],
verbose=False),
['LOC_ENTITY_LIST'])],
verbose=False)),
('bernoullinb',
BernoulliNB(alpha=0.00026366508987303583, binarize=0.6,
class_prior=None, fit_prior=True))],
verbose=False)
>> Best selected parameter:
{'bernoullinb__fit_prior': True, 'bernoullinb__binarize': 0.6, 'bernoullinb__alpha': 0.00026366508987303583}
Any suggestion of what to do/test next and the explanation of why this is the case will be most appreciated.
In machine learning and deep learning for either discrete/generative model, we always think and remember what Occam Razor said, “when presented with competing hypothetical answers to a problem, one should select the one that makes the fewest assumptions”. Which can be translated as:
Models with least complexity (i.e. hyper-parameters, features, etc.) tend to perform better than one with higher complexity.
It is always recommend to start with least complex model and then partially increase toward more elaborates one (with higher hyper-parameters, more assumptions, etc.)
A Random Forest is already a very complex model, which involves a bagging ensemble method of multiple Decision Trees.
My recommendation would be to try a very simple Decision Tree and update its hyper -parameters if you can obtain a better or similar performance than you other models, then I will try add more complexity and very simple RF with minimal hyper-parameters such as 2 < randomforestclassifier__max_leaf_node < 10
from mlxtend.regressor import StackingRegressor
from sklearn.ensemble.forest import RandomForestRegressor as RFR
from sklearn.ensemble import GradientBoostingRegressor as GBR
import xgboost as xgb
rfr = RFR(n_estimators=500, n_jobs=cc.ncpu, random_state=0)
gbr = GBR(n_estimators=1000, random_state=0)
xgr = xgb.XGBRegressor()
mtr = RFR() # meta regressor
regressors = [rfr, gbr, xgr]
model = StackingRegressor(regressors=regressors, meta_regressor=mtr)
param_grid = {
'fs__threshold': ['median'],
'fs__estimator__max_features': ['log2'],
'clf__rfr__max_features': ['auto', 'log2'],
'clf__gbr__learning_rate': [0.05, 0.02, 0.01],
'clf__gbr__max_depth': [4, 5, 6, 7],
'clf__gbr__max_features': ['auto', 'log2'],
'clf__gbr__n_estimators': [500, 1000, 2000],
'clf__xgr__learning_rate': [0.001, 0.05, 0.1, 0.2],
'clf__xgr__max_depth': [2, 4, 6],
'clf__xgr__min_child_weight': [1, 3, 5],
'clf__xgr__n_estimators': [500, 1000],
'clf__meta-mtr__n_estimators': [750, 1500]
}
rf_feature_imp = RFR(250, n_jobs=cc.ncpu)
feat_selection = SelectFromModel(rf_feature_imp)
pipeline = Pipeline([('fs', feat_selection), ('clf', model), ])
gs = GridSearchCV(pipeline, param_grid=param_grid, verbose=1, n_jobs=-1, error_score=np.nan)
In the code above, I want to use the mlxtend voting regressor and also use a random forest to select relevant features. However, this code is not working and I get an error
ValueError: Invalid parameter xgr for estimator StackingRegressor(meta_regressor=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False),
regressors=[RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=500, n_jobs=5, oob_sc...eg:linear', reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=0, silent=True, subsample=1)],
verbose=0). Check the list of available parameters with `estimator.get_params().keys()`.
How to fix this?
I am trying to train an SVM in scikit. I am following the example and tried to adjust it to my 3d feature vectors.
I tried the example from the page http://scikit-learn.org/stable/modules/svm.html
and it ran through. While bugfixing I came back to the tutorial setup and found this:
X = [[0, 0], [1, 1],[2,2]]
y = [0, 1,1]
clf = svm.SVC()
clf.fit(X, y)
works while
X = [[0, 0,0], [1, 1,1],[2,2,2]]
y = [0, 1,1]
clf = svm.SVC()
clf.fit(X, y)
fails with:
ValueError: X.shape[1] = 2 should be equal to 3, the number of features at training time
what is wrong here? It's only one additional dimension...
Thanks,
El
Running your latter code works for me:
>>> X = [[0,0,0], [1,1,1], [2,2,2]]
>>> y = [0,1,1]
>>> clf = svm.SVC()
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=False, shrinking=True, tol=0.001,
verbose=False)
That error message seems like it should actually happen when you're calling .predict() on an SVM object with kernel="precomputed". Is that the case?