Related
I am trying to apply RandomizedSearchCV on a RegressorChain XGBoost model but I got an error : Invalid parameter learning_rate for estimator RegressorChain(base_estimator=XGBRegressor.
If I comment all the values in grid dict, it works otherwise it doesn't accept any param.
Same models (XGBRegressor and RegressorChain) are working fine alone. The RandomizedSearchCV is not accepting the the params in grid dict
# Setup the parameters grid
grid = {
'n_estimators': [100, 500, 1000],
'max_depth': [5, 10, 20, 30],
'max_features': ["auto", "sqrt"],
'eta': [0.09, 0.1, 0.2],
'booster': ["dart", "gblinear"]
}
clf = XGBRegressor(objective='reg:squarederror')
chain = RegressorChain(base_estimator=clf, order=[0, 1, 2, 3, 4,5])
# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=chain,
param_distributions=grid,
n_iter=10, # number of models to try
cv=5,
verbose=1,
random_state=42,
refit=True)
# Fit the RandomizedSearchCV version of clf
rs_clf.fit(X_train, y_train) # 'rs' is short
Since the XGBRegressor is the base_estimator of RegressorChain, the parameters of XGBRegressor become nested and must be addressed with base_estimator__xxx:
grid = {
'base_estimator__n_estimators': [100, 500, 1000],
'base_estimator__max_depth': [5, 10, 20, 30],
'base_estimator__max_features': ["auto", "sqrt"],
'base_estimator__eta': [0.09, 0.1, 0.2],
'base_estimator__booster': ["dart", "gblinear"]
}
I have multi variate time series data, want to detect the anomalies with isolation forest algorithm.
want to get best parameters from gridSearchCV, here is the code snippet of gridSearch CV.
input data set loaded with below snippet.
df = pd.read_csv("train.csv")
df.drop(['dataTimestamp','Anomaly'], inplace=True, axis=1)
X_train = df
y_train = df1[['Anomaly']] ( Anomaly column is labelled data).
define the parameters for Isolation Forest.
clf = IsolationForest(random_state=47, behaviour='new', score="accuracy")
param_grid = {'n_estimators': list(range(100, 800, 5)), 'max_samples': list(range(100, 500, 5)), 'contamination': [0.1, 0.2, 0.3, 0.4, 0.5], 'max_features': [5,10,15], 'bootstrap': [True, False], 'n_jobs': [5, 10, 20, 30]}
f1sc = make_scorer(f1_score)
grid_dt_estimator = model_selection.GridSearchCV(clf, param_grid,scoring=f1sc, refit=True,cv=10, return_train_score=True)
grid_dt_estimator.fit(X_train, y_train)
after executing the fit , got the below error.
ValueError: Target is multiclass but average='binary'. Please choose another average setting.
Can some one guide me what is this about, tried average='weight', but still no luck, anything am doing wrong here.
please let me know how to get F-score as well.
You incur in this error because you didn't set the parameter average when transforming the f1_score into a scorer. In fact, as detailed in the documentation:
average : string, [None, ‘binary’ (default), ‘micro’, ‘macro’,
‘samples’, ‘weighted’] This parameter is required for
multiclass/multilabel targets. If None, the scores for each class are
returned.
The consequence is that the scorer returns multiple scores for each class in your classification problem, instead of a single measure. The solution is to declare one of the possible values of the average parameter for f1_score, depending on your needs. I therefore refactored the code you provided as an example in order to provide a possible solution to your problem:
from sklearn.ensemble import IsolationForest
from sklearn.metrics import make_scorer, f1_score
from sklearn import model_selection
from sklearn.datasets import make_classification
X_train, y_train = make_classification(n_samples=500,
n_classes=2)
clf = IsolationForest(random_state=47, behaviour='new')
param_grid = {'n_estimators': list(range(100, 800, 5)),
'max_samples': list(range(100, 500, 5)),
'contamination': [0.1, 0.2, 0.3, 0.4, 0.5],
'max_features': [5,10,15],
'bootstrap': [True, False],
'n_jobs': [5, 10, 20, 30]}
f1sc = make_scorer(f1_score(average='micro'))
grid_dt_estimator = model_selection.GridSearchCV(clf,
param_grid,
scoring=f1sc,
refit=True,
cv=10,
return_train_score=True)
grid_dt_estimator.fit(X_train, y_train)
Update make_scorer with this to get it working.
make_scorer(f1_score, average='micro')
Parameters you tune are not all necessary.
For example:
contamination is the rate for abnomaly, you can determin the best value after you fitted a model by tune the threshold on model.score_samples
n_jobs is the CPU core you used.
I tried to use GridSearchCV in order to have the best parameters for my classifier. I am using the one-class SVM, and my code is:
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5],
'nu': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000]},
{'kernel': ['linear'], 'nu': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000]}
]
scores = ['precision', 'recall']
for score in scores:
print("# Tuning hyper-parameters for %s" % score)
print()
clf = GridSearchCV(svm.OneClassSVM(), tuned_parameters,
scoring='%s_macro' % score)
clf.fit(input_dataN)
i got errors :
TypeError: __call__() missing 1 required positional argument: 'y_true'
How to fix it please?
When you apply the fit method you need to supply your features (X_train) as well as your target class labels (y_train):
Fix this line:
clf.fit(input_dataN)
I am trying to use scikit-learn GridSearchCV together with XGBoost XGBClassifier wrapper for my unbalanced multi-class classification problem. So far I have used a list of class weights as an input for the scale_pos_weight argument, but this does not seem to work as all my predictions are for the majority class. This is probably because in the documentation of the XGBClassifier it is mentioned that scale_pos_weight can only be used for binary classification problems.
So my question is, how can I input sample/class weights for a multi-class classification task using scikit-learn GridSearchCV?
My code is below:
import numpy as np
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight('balanced', np.unique(training_targets),
training_targets[target_label[0]])
random_state = np.random.randint(0, 1000)
parameters = {
'max_depth': [3, 4, 5],
'learning_rate': [0.1, 0.2, 0.3],
'n_estimators': [50, 100, 150],
'gamma': [0, 0.1, 0.2],
'min_child_weight': [0, 0.5, 1],
'max_delta_step': [0],
'subsample': [0.7, 0.8, 0.9, 1],
'colsample_bytree': [0.6, 0.8, 1],
'colsample_bylevel': [1],
'reg_alpha': [0, 1e-2, 1, 1e1],
'reg_lambda': [0, 1e-2, 1, 1e1],
'base_score': [0.5]
}
xgb_model = xgb.XGBClassifier(scale_pos_weight = class_weights, silent = True,
random_state = random_state)
clf = GridSearchCV(xgb_model, parameters, scoring = 'f1_micro', n_jobs = -1, cv = 5)
clf.fit(training_features, training_targets.values[:, 0])
model = clf.best_estimator_
The scale_pos_weight is only for binary classification, so it won't work on multi-label classification tasks.
For your case, it's more advisable to use the weight parameter as described here (https://xgboost.readthedocs.io/en/latest/python/python_api.html). The argument will be an array which each element represents the weight you assigned for the corresponding data point.
The idea is essentially to manually assign different weights to different classes. There's no standard in how you need to assign weights, it's more up to your decision. The more weight a sample is being assigned, the more it affects the objective function during the training.
However, if you use the scikit learn API format, you cannot specify the weight parameter nor using the DMAtrix format. Thankfully, xgboost has its own cross validation function, which you can find details here: https://xgboost.readthedocs.io/en/latest/python/python_api.html
I suggest that you use the compute_sample_weight() function and set weights for each sample by looking at your labels. This will solve your problem in the most elegant way. See below for 3 classes (-1,0,1):
sample_weights=compute_sample_weight({-1:4,0:1,1:4},Train_Labels)
random_search = RandomizedSearchCV(model, param_distributions=params, n_iter=param_comb,return_train_score=True, scoring=score,cv=ps, n_jobs=-1, verbose=3, random_state=1001 )
random_search.fit(Train,Train_Labels,sample_weight=sample_weights)
In a multi-class setup we need to pass sample_weight parameter with a list of values (weights) matching the count of data-points (for example number of rows in X_train), to fit() of XGBoostClassifier. Check the docs.
While using XGBoostClassifier with scikit-learn GridSearchCV, you can pass sample_weight directly to the fit() of GridSearchCV.
Note: Tried in scikit-learn version 1.1.1. Not sure from which version onwards this is supported.
For example:
def get_weights(cls):
class_weights = {
# class-labels based on your dataset.
0: 1,
1: 4,
2: 1,
}
return [class_weights[cl] for cl in cls]
grid = {
"max_depth": [3, 4, 5, 6],
"n_estimators": range(20, 70, 10),
"learning_rate": np.arange(0.25, 0.50, 0.05),
}
xgb_clf = XGBClassifier(random_state=42, n_jobs=-1)
xgb_cvm = GridSearchCV(estimator=xgb_clf, param_grid=grid, n_jobs=-1, cv=5)
xgb_cvm.fit(X, y, sample_weight=get_weights(y))
For example, I have CNN which tries to predict numbers from MNIST dataset (code written using Keras). It has 10 outputs, which form softmax layer. Only one of outputs can be true (independently for each digit from 0 to 9):
Real: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
Predicted: [0.02, 0.9, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]
Sum of predicted is equal to 1.0 due to definition of softmax.
Let's say I have a task where I need to classify some objects that can fall in several categories:
Real: [0, 1, 0, 1, 0, 1, 0, 0, 0, 1]
So I need to normalize in some other way. I need function which gives value on range [0, 1] and which sum can be larger than 1.
I need something like that:
Predicted: [0.1, 0.9, 0.05, 0.9, 0.01, 0.8, 0.1, 0.01, 0.2, 0.9]
Each number is probability that object falls in given category. After that I can use some threshold like 0.5 to distinguish categories in which given object falls.
The following questions appear:
So which activation function can be used for this?
May be this function already exists in Keras?
May be you can propose some other way to predict in this case?
Your problem is one of multi-label classification, and in the context of Keras it is discussed, for example, here: https://github.com/fchollet/keras/issues/741
In short the suggested solution for it in keras is to replace the softmax layer with a sigmoid layer and use binary_crossentropy as your cost function.
an example from that thread:
# Build a classifier optimized for maximizing f1_score (uses class_weights)
clf = Sequential()
clf.add(Dropout(0.3))
clf.add(Dense(xt.shape[1], 1600, activation='relu'))
clf.add(Dropout(0.6))
clf.add(Dense(1600, 1200, activation='relu'))
clf.add(Dropout(0.6))
clf.add(Dense(1200, 800, activation='relu'))
clf.add(Dropout(0.6))
clf.add(Dense(800, yt.shape[1], activation='sigmoid'))
clf.compile(optimizer=Adam(), loss='binary_crossentropy')
clf.fit(xt, yt, batch_size=64, nb_epoch=300, validation_data=(xs, ys), class_weight=W, verbose=0)
preds = clf.predict(xs)
preds[preds>=0.5] = 1
preds[preds<0.5] = 0
print f1_score(ys, preds, average='macro')