I have a classification problem for which I am trying to build an ensemble using two classifiers, say for example KNeighbours, Decision Tree.In addition to this, I want to implement it using Pipeline. Now this is my attempt to the problem:
steps = [('scaler', StandardScaler()),
('regressor', VotingClassifier(estimators=[
('knn', KNeighborsClassifier()),
('clf', RandomForestClassifier())],voting='soft'))]
pipeline = Pipeline(steps)
parameters = [{'knn__n_neighbors': np.arange(1, 50)}, {
'clf__n_estimators': [10, 20, 30],
'clf__criterion': ['gini', 'entropy'],
'clf__max_features': [5, 10, 15],
'clf__max_depth': ['auto', 'log2', 'sqrt', None]}]
X_train, X_test, y_train, y_test = train_test_split(X, y.values.ravel(),
test_size=0.3, random_state=65)
cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)
On running this following error pops up:
Invalid parameter knn for estimator
Pipeline(steps=[('scaler', StandardScaler()),
('regressor', VotingClassifier(
estimators=[('knn', KNeighborsClassifier()),
('clf', RandomForestClassifier())
]
)
)
]
).
Check the list of available parameters with `estimator.get_params().keys()`.
I belive their is some error in how I have defined the parameter grid. Please help me out in this.
Since it's nested, you'll need to specify both prefixes, like this:
parameters = [{'regressor__knn__n_neighbors': np.arange(1, 5), #} { And you'd probably want it to be a single grid?
'regressor__clf__n_estimators': [10, 20, 30],
'regressor__clf__criterion': ['gini', 'entropy'],
'regressor__clf__max_depth': [5, 10, 15],
'regressor__clf__max_features': ['log2', 'sqrt', None]}]
Also, your max_depth and max_features values switched their supposed places somehow, fixed that. (And 'auto' does the same as 'sqrt', at least for the recent versions.)
Related
Below is my code, I run it twice. The first one with "criterion": ['gini', 'entropy'] and the second one with just 'entropy' ('gini' was removed), nothing else changed. I expected with less number of combinations, the score must be equal or lower, but it was higher - How is it possible?? No randomness and those numbers repeated all the times.
Using "criterion": ['gini', 'entropy'] got score of 0.850
Using "criterion": ['entropy'] got score of 0.871 (higher)
X_train, X_test, y_train, y_test = train_test_split(dataX, dataY, test_size=0.2, random_state=10, stratify=dataY)
gs_params = {
"criterion": ['gini', 'entropy'],
"max_depth": [3, 4, 5, 6, 8, 10, 12, 15],
"min_samples_split": range(2, 9, 2),
"min_samples_leaf": range(1, 5)
}
gs = GridSearchCV(model, param_grid=gs_params, n_jobs=-1, verbose=1,
cv=5, scoring='f1_weighted', refit=True)
clf = Pipeline([
('scaler', StandardScaler()),
('features1', SelectFromModel(RandomForestClassifier(random_state=80), threshold='median')),
('features2', SelectFromModel(RandomForestClassifier(random_state=81), threshold='median')),
('features3', SelectFromModel(RandomForestClassifier(random_state=82), threshold='median')),
('gs', gs)
])
clf.fit(X_train.values, y_train.values)
test_score_opt = clf.score(X_test.values, y_test.values)
Variable names for my training and test data are X_train, X_test, Y_train, Y_test
I have ran a GridSearchCV instance from sklearn to do hyperparameter tuning for my random forest model.
param_grid = {
'n_estimators': [500],
'max_features': ['sqrt', None],
'max_depth': [ 6 ],
'max_leaf_nodes': [8],
'min_impurity_decrease':[0,0.02],
'min_samples_split':[2]
}
grid_search= GridSearchCV(RandomForestClassifier(
criterion='gini',
min_weight_fraction_leaf=0.0,
bootstrap=True,
n_jobs=-1,
random_state=1, verbose=0,
warm_start=False, class_weight='balanced',
ccp_alpha=0.0,
max_samples=None),
param_grid=param_grid,verbose=50,cv=2,n_jobs=-1,scoring='balanced_accuracy')
grid_search.fit(X_train, Y_train)
All the scores that I can see while the gridseach is training are in the range of 0.4-.6
Following is the output of the best score:
[CV 2/2; 1/4] END max_depth=6, max_features=sqrt, max_leaf_nodes=8, min_impurity_decrease=0, min_samples_split=2, n_estimators=500;, score=0.552 total time= 15.4s
My questions is when I am manually calculating balanced_accuracy using
from sklearn.metrics import balanced_accuracy_score, by running print('training accuracy', balanced_accuracy_score(grid_search.predict(X_train), Y_train,adjusted=False)), I am getting a value of about 0.96 which is very different from what the output of gridsearchcv is showing during the run. Why is this so? And what does the score in gridsearchcv mean then? Please note I have passed the parameter scoring = 'balanced_accuracy' in gridsearchcv to make sure they calculate the same thing.
The score you get from gridsearchcv is the validation score (score measured on the part of X_train not used to train the model).
Your manually calculated score is the training score (you fit the model and evaluate the score on the same data: X_train).
The high difference is a sign of overfitting.
You can try to change param_grid :
param_grid = {
'n_estimators': [100, 200], # 500 seems high and might take too long for no reason
'max_features': ['sqrt','log2', None], # Less features can reduce overfitting
'max_depth': [3, 4, 5, 6 ], # Lower depth can reduce overfitting
'max_leaf_nodes': [4, 6, 8], # Lower max_leaf_nodes can reduce overfitting
'min_impurity_decrease':[0,0.02],
'min_samples_split':[2],
'min_samples_leaf': [5, 10, 20] # Higher values can reduce overfitting
}
Also using cv=3 or cv=5 in GridSearchCV could help.
See this post about solving Random Forest overfitting.
I am trying to apply RandomizedSearchCV on a RegressorChain XGBoost model but I got an error : Invalid parameter learning_rate for estimator RegressorChain(base_estimator=XGBRegressor.
If I comment all the values in grid dict, it works otherwise it doesn't accept any param.
Same models (XGBRegressor and RegressorChain) are working fine alone. The RandomizedSearchCV is not accepting the the params in grid dict
# Setup the parameters grid
grid = {
'n_estimators': [100, 500, 1000],
'max_depth': [5, 10, 20, 30],
'max_features': ["auto", "sqrt"],
'eta': [0.09, 0.1, 0.2],
'booster': ["dart", "gblinear"]
}
clf = XGBRegressor(objective='reg:squarederror')
chain = RegressorChain(base_estimator=clf, order=[0, 1, 2, 3, 4,5])
# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=chain,
param_distributions=grid,
n_iter=10, # number of models to try
cv=5,
verbose=1,
random_state=42,
refit=True)
# Fit the RandomizedSearchCV version of clf
rs_clf.fit(X_train, y_train) # 'rs' is short
Since the XGBRegressor is the base_estimator of RegressorChain, the parameters of XGBRegressor become nested and must be addressed with base_estimator__xxx:
grid = {
'base_estimator__n_estimators': [100, 500, 1000],
'base_estimator__max_depth': [5, 10, 20, 30],
'base_estimator__max_features': ["auto", "sqrt"],
'base_estimator__eta': [0.09, 0.1, 0.2],
'base_estimator__booster': ["dart", "gblinear"]
}
I have multi variate time series data, want to detect the anomalies with isolation forest algorithm.
want to get best parameters from gridSearchCV, here is the code snippet of gridSearch CV.
input data set loaded with below snippet.
df = pd.read_csv("train.csv")
df.drop(['dataTimestamp','Anomaly'], inplace=True, axis=1)
X_train = df
y_train = df1[['Anomaly']] ( Anomaly column is labelled data).
define the parameters for Isolation Forest.
clf = IsolationForest(random_state=47, behaviour='new', score="accuracy")
param_grid = {'n_estimators': list(range(100, 800, 5)), 'max_samples': list(range(100, 500, 5)), 'contamination': [0.1, 0.2, 0.3, 0.4, 0.5], 'max_features': [5,10,15], 'bootstrap': [True, False], 'n_jobs': [5, 10, 20, 30]}
f1sc = make_scorer(f1_score)
grid_dt_estimator = model_selection.GridSearchCV(clf, param_grid,scoring=f1sc, refit=True,cv=10, return_train_score=True)
grid_dt_estimator.fit(X_train, y_train)
after executing the fit , got the below error.
ValueError: Target is multiclass but average='binary'. Please choose another average setting.
Can some one guide me what is this about, tried average='weight', but still no luck, anything am doing wrong here.
please let me know how to get F-score as well.
You incur in this error because you didn't set the parameter average when transforming the f1_score into a scorer. In fact, as detailed in the documentation:
average : string, [None, ‘binary’ (default), ‘micro’, ‘macro’,
‘samples’, ‘weighted’] This parameter is required for
multiclass/multilabel targets. If None, the scores for each class are
returned.
The consequence is that the scorer returns multiple scores for each class in your classification problem, instead of a single measure. The solution is to declare one of the possible values of the average parameter for f1_score, depending on your needs. I therefore refactored the code you provided as an example in order to provide a possible solution to your problem:
from sklearn.ensemble import IsolationForest
from sklearn.metrics import make_scorer, f1_score
from sklearn import model_selection
from sklearn.datasets import make_classification
X_train, y_train = make_classification(n_samples=500,
n_classes=2)
clf = IsolationForest(random_state=47, behaviour='new')
param_grid = {'n_estimators': list(range(100, 800, 5)),
'max_samples': list(range(100, 500, 5)),
'contamination': [0.1, 0.2, 0.3, 0.4, 0.5],
'max_features': [5,10,15],
'bootstrap': [True, False],
'n_jobs': [5, 10, 20, 30]}
f1sc = make_scorer(f1_score(average='micro'))
grid_dt_estimator = model_selection.GridSearchCV(clf,
param_grid,
scoring=f1sc,
refit=True,
cv=10,
return_train_score=True)
grid_dt_estimator.fit(X_train, y_train)
Update make_scorer with this to get it working.
make_scorer(f1_score, average='micro')
Parameters you tune are not all necessary.
For example:
contamination is the rate for abnomaly, you can determin the best value after you fitted a model by tune the threshold on model.score_samples
n_jobs is the CPU core you used.
I am trying to use scikit-learn GridSearchCV together with XGBoost XGBClassifier wrapper for my unbalanced multi-class classification problem. So far I have used a list of class weights as an input for the scale_pos_weight argument, but this does not seem to work as all my predictions are for the majority class. This is probably because in the documentation of the XGBClassifier it is mentioned that scale_pos_weight can only be used for binary classification problems.
So my question is, how can I input sample/class weights for a multi-class classification task using scikit-learn GridSearchCV?
My code is below:
import numpy as np
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight('balanced', np.unique(training_targets),
training_targets[target_label[0]])
random_state = np.random.randint(0, 1000)
parameters = {
'max_depth': [3, 4, 5],
'learning_rate': [0.1, 0.2, 0.3],
'n_estimators': [50, 100, 150],
'gamma': [0, 0.1, 0.2],
'min_child_weight': [0, 0.5, 1],
'max_delta_step': [0],
'subsample': [0.7, 0.8, 0.9, 1],
'colsample_bytree': [0.6, 0.8, 1],
'colsample_bylevel': [1],
'reg_alpha': [0, 1e-2, 1, 1e1],
'reg_lambda': [0, 1e-2, 1, 1e1],
'base_score': [0.5]
}
xgb_model = xgb.XGBClassifier(scale_pos_weight = class_weights, silent = True,
random_state = random_state)
clf = GridSearchCV(xgb_model, parameters, scoring = 'f1_micro', n_jobs = -1, cv = 5)
clf.fit(training_features, training_targets.values[:, 0])
model = clf.best_estimator_
The scale_pos_weight is only for binary classification, so it won't work on multi-label classification tasks.
For your case, it's more advisable to use the weight parameter as described here (https://xgboost.readthedocs.io/en/latest/python/python_api.html). The argument will be an array which each element represents the weight you assigned for the corresponding data point.
The idea is essentially to manually assign different weights to different classes. There's no standard in how you need to assign weights, it's more up to your decision. The more weight a sample is being assigned, the more it affects the objective function during the training.
However, if you use the scikit learn API format, you cannot specify the weight parameter nor using the DMAtrix format. Thankfully, xgboost has its own cross validation function, which you can find details here: https://xgboost.readthedocs.io/en/latest/python/python_api.html
I suggest that you use the compute_sample_weight() function and set weights for each sample by looking at your labels. This will solve your problem in the most elegant way. See below for 3 classes (-1,0,1):
sample_weights=compute_sample_weight({-1:4,0:1,1:4},Train_Labels)
random_search = RandomizedSearchCV(model, param_distributions=params, n_iter=param_comb,return_train_score=True, scoring=score,cv=ps, n_jobs=-1, verbose=3, random_state=1001 )
random_search.fit(Train,Train_Labels,sample_weight=sample_weights)
In a multi-class setup we need to pass sample_weight parameter with a list of values (weights) matching the count of data-points (for example number of rows in X_train), to fit() of XGBoostClassifier. Check the docs.
While using XGBoostClassifier with scikit-learn GridSearchCV, you can pass sample_weight directly to the fit() of GridSearchCV.
Note: Tried in scikit-learn version 1.1.1. Not sure from which version onwards this is supported.
For example:
def get_weights(cls):
class_weights = {
# class-labels based on your dataset.
0: 1,
1: 4,
2: 1,
}
return [class_weights[cl] for cl in cls]
grid = {
"max_depth": [3, 4, 5, 6],
"n_estimators": range(20, 70, 10),
"learning_rate": np.arange(0.25, 0.50, 0.05),
}
xgb_clf = XGBClassifier(random_state=42, n_jobs=-1)
xgb_cvm = GridSearchCV(estimator=xgb_clf, param_grid=grid, n_jobs=-1, cv=5)
xgb_cvm.fit(X, y, sample_weight=get_weights(y))