Grid search and XGBClassifier using class weights - scikit-learn

I am trying to use scikit-learn GridSearchCV together with XGBoost XGBClassifier wrapper for my unbalanced multi-class classification problem. So far I have used a list of class weights as an input for the scale_pos_weight argument, but this does not seem to work as all my predictions are for the majority class. This is probably because in the documentation of the XGBClassifier it is mentioned that scale_pos_weight can only be used for binary classification problems.
So my question is, how can I input sample/class weights for a multi-class classification task using scikit-learn GridSearchCV?
My code is below:
import numpy as np
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight('balanced', np.unique(training_targets),
training_targets[target_label[0]])
random_state = np.random.randint(0, 1000)
parameters = {
'max_depth': [3, 4, 5],
'learning_rate': [0.1, 0.2, 0.3],
'n_estimators': [50, 100, 150],
'gamma': [0, 0.1, 0.2],
'min_child_weight': [0, 0.5, 1],
'max_delta_step': [0],
'subsample': [0.7, 0.8, 0.9, 1],
'colsample_bytree': [0.6, 0.8, 1],
'colsample_bylevel': [1],
'reg_alpha': [0, 1e-2, 1, 1e1],
'reg_lambda': [0, 1e-2, 1, 1e1],
'base_score': [0.5]
}
xgb_model = xgb.XGBClassifier(scale_pos_weight = class_weights, silent = True,
random_state = random_state)
clf = GridSearchCV(xgb_model, parameters, scoring = 'f1_micro', n_jobs = -1, cv = 5)
clf.fit(training_features, training_targets.values[:, 0])
model = clf.best_estimator_

The scale_pos_weight is only for binary classification, so it won't work on multi-label classification tasks.
For your case, it's more advisable to use the weight parameter as described here (https://xgboost.readthedocs.io/en/latest/python/python_api.html). The argument will be an array which each element represents the weight you assigned for the corresponding data point.
The idea is essentially to manually assign different weights to different classes. There's no standard in how you need to assign weights, it's more up to your decision. The more weight a sample is being assigned, the more it affects the objective function during the training.
However, if you use the scikit learn API format, you cannot specify the weight parameter nor using the DMAtrix format. Thankfully, xgboost has its own cross validation function, which you can find details here: https://xgboost.readthedocs.io/en/latest/python/python_api.html

I suggest that you use the compute_sample_weight() function and set weights for each sample by looking at your labels. This will solve your problem in the most elegant way. See below for 3 classes (-1,0,1):
sample_weights=compute_sample_weight({-1:4,0:1,1:4},Train_Labels)
random_search = RandomizedSearchCV(model, param_distributions=params, n_iter=param_comb,return_train_score=True, scoring=score,cv=ps, n_jobs=-1, verbose=3, random_state=1001 )
random_search.fit(Train,Train_Labels,sample_weight=sample_weights)

In a multi-class setup we need to pass sample_weight parameter with a list of values (weights) matching the count of data-points (for example number of rows in X_train), to fit() of XGBoostClassifier. Check the docs.
While using XGBoostClassifier with scikit-learn GridSearchCV, you can pass sample_weight directly to the fit() of GridSearchCV.
Note: Tried in scikit-learn version 1.1.1. Not sure from which version onwards this is supported.
For example:
def get_weights(cls):
class_weights = {
# class-labels based on your dataset.
0: 1,
1: 4,
2: 1,
}
return [class_weights[cl] for cl in cls]
grid = {
"max_depth": [3, 4, 5, 6],
"n_estimators": range(20, 70, 10),
"learning_rate": np.arange(0.25, 0.50, 0.05),
}
xgb_clf = XGBClassifier(random_state=42, n_jobs=-1)
xgb_cvm = GridSearchCV(estimator=xgb_clf, param_grid=grid, n_jobs=-1, cv=5)
xgb_cvm.fit(X, y, sample_weight=get_weights(y))

Related

Sklearn gridsearchcv score not matching

Variable names for my training and test data are X_train, X_test, Y_train, Y_test
I have ran a GridSearchCV instance from sklearn to do hyperparameter tuning for my random forest model.
param_grid = {
'n_estimators': [500],
'max_features': ['sqrt', None],
'max_depth': [ 6 ],
'max_leaf_nodes': [8],
'min_impurity_decrease':[0,0.02],
'min_samples_split':[2]
}
grid_search= GridSearchCV(RandomForestClassifier(
criterion='gini',
min_weight_fraction_leaf=0.0,
bootstrap=True,
n_jobs=-1,
random_state=1, verbose=0,
warm_start=False, class_weight='balanced',
ccp_alpha=0.0,
max_samples=None),
param_grid=param_grid,verbose=50,cv=2,n_jobs=-1,scoring='balanced_accuracy')
grid_search.fit(X_train, Y_train)
All the scores that I can see while the gridseach is training are in the range of 0.4-.6
Following is the output of the best score:
[CV 2/2; 1/4] END max_depth=6, max_features=sqrt, max_leaf_nodes=8, min_impurity_decrease=0, min_samples_split=2, n_estimators=500;, score=0.552 total time= 15.4s
My questions is when I am manually calculating balanced_accuracy using
from sklearn.metrics import balanced_accuracy_score, by running print('training accuracy', balanced_accuracy_score(grid_search.predict(X_train), Y_train,adjusted=False)), I am getting a value of about 0.96 which is very different from what the output of gridsearchcv is showing during the run. Why is this so? And what does the score in gridsearchcv mean then? Please note I have passed the parameter scoring = 'balanced_accuracy' in gridsearchcv to make sure they calculate the same thing.
The score you get from gridsearchcv is the validation score (score measured on the part of X_train not used to train the model).
Your manually calculated score is the training score (you fit the model and evaluate the score on the same data: X_train).
The high difference is a sign of overfitting.
You can try to change param_grid :
param_grid = {
'n_estimators': [100, 200], # 500 seems high and might take too long for no reason
'max_features': ['sqrt','log2', None], # Less features can reduce overfitting
'max_depth': [3, 4, 5, 6 ], # Lower depth can reduce overfitting
'max_leaf_nodes': [4, 6, 8], # Lower max_leaf_nodes can reduce overfitting
'min_impurity_decrease':[0,0.02],
'min_samples_split':[2],
'min_samples_leaf': [5, 10, 20] # Higher values can reduce overfitting
}
Also using cv=3 or cv=5 in GridSearchCV could help.
See this post about solving Random Forest overfitting.

Isolation Forest Parameter tuning with gridSearchCV

I have multi variate time series data, want to detect the anomalies with isolation forest algorithm.
want to get best parameters from gridSearchCV, here is the code snippet of gridSearch CV.
input data set loaded with below snippet.
df = pd.read_csv("train.csv")
df.drop(['dataTimestamp','Anomaly'], inplace=True, axis=1)
X_train = df
y_train = df1[['Anomaly']] ( Anomaly column is labelled data).
define the parameters for Isolation Forest.
clf = IsolationForest(random_state=47, behaviour='new', score="accuracy")
param_grid = {'n_estimators': list(range(100, 800, 5)), 'max_samples': list(range(100, 500, 5)), 'contamination': [0.1, 0.2, 0.3, 0.4, 0.5], 'max_features': [5,10,15], 'bootstrap': [True, False], 'n_jobs': [5, 10, 20, 30]}
f1sc = make_scorer(f1_score)
grid_dt_estimator = model_selection.GridSearchCV(clf, param_grid,scoring=f1sc, refit=True,cv=10, return_train_score=True)
grid_dt_estimator.fit(X_train, y_train)
after executing the fit , got the below error.
ValueError: Target is multiclass but average='binary'. Please choose another average setting.
Can some one guide me what is this about, tried average='weight', but still no luck, anything am doing wrong here.
please let me know how to get F-score as well.
You incur in this error because you didn't set the parameter average when transforming the f1_score into a scorer. In fact, as detailed in the documentation:
average : string, [None, ‘binary’ (default), ‘micro’, ‘macro’,
‘samples’, ‘weighted’] This parameter is required for
multiclass/multilabel targets. If None, the scores for each class are
returned.
The consequence is that the scorer returns multiple scores for each class in your classification problem, instead of a single measure. The solution is to declare one of the possible values of the average parameter for f1_score, depending on your needs. I therefore refactored the code you provided as an example in order to provide a possible solution to your problem:
from sklearn.ensemble import IsolationForest
from sklearn.metrics import make_scorer, f1_score
from sklearn import model_selection
from sklearn.datasets import make_classification
X_train, y_train = make_classification(n_samples=500,
n_classes=2)
clf = IsolationForest(random_state=47, behaviour='new')
param_grid = {'n_estimators': list(range(100, 800, 5)),
'max_samples': list(range(100, 500, 5)),
'contamination': [0.1, 0.2, 0.3, 0.4, 0.5],
'max_features': [5,10,15],
'bootstrap': [True, False],
'n_jobs': [5, 10, 20, 30]}
f1sc = make_scorer(f1_score(average='micro'))
grid_dt_estimator = model_selection.GridSearchCV(clf,
param_grid,
scoring=f1sc,
refit=True,
cv=10,
return_train_score=True)
grid_dt_estimator.fit(X_train, y_train)
Update make_scorer with this to get it working.
make_scorer(f1_score, average='micro')
Parameters you tune are not all necessary.
For example:
contamination is the rate for abnomaly, you can determin the best value after you fitted a model by tune the threshold on model.score_samples
n_jobs is the CPU core you used.

GridSearch for MultiClass KerasClassifier

I am trying to do a grid search for a multiclass classification with Keras. Here is a section of the code:
Some properties of the data are below:
y_
array(['fast', 'immobile', 'immobile', ..., 'slow',
'immobile', 'slow'],
dtype='<U17')
y_onehot = pd.get_dummies(y_).values
y_onehot
array([[1, 0, 0],
[0, 0, 1],
[0, 0, 1],
...
[0, 1, 0],
[0, 0, 1],
[0, 1, 0]], dtype=uint8)
#Do train-test split
y_train.shape
(1904,)
y_train_onehot.shape
(1904, 3)
And the model...
# Function to create model, required for KerasClassifier
def create_model(optimizer='rmsprop', init='glorot_uniform'):
# create model
model = Sequential()
model.add(Dense(2048, input_dim=X_train.shape[1], kernel_initializer=init, activation='relu'))
model.add(Dense(512, kernel_initializer=init, activation='relu'))
model.add(Dense(y_train_onehot.shape[1], kernel_initializer=init, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
return model
# create model
model = KerasClassifier(build_fn=create_model, verbose=0)
# grid search epochs, batch size and optimizer
optimizers = ['rmsprop', 'adam']
init = ['glorot_uniform', 'normal', 'uniform']
epochs = [50, 100, 150]
batches = [5, 10, 20]
param_grid = dict(optimizer=optimizers, epochs=epochs, batch_size=batches, init=init)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy')
grid_result = grid.fit(X_train, y_train_onehot)
And here is the error:
--> grid_result = grid.fit(X_train, y_train_onehot)
ValueError: Classification metrics can't handle a mix of multilabel-indicator and multiclass targets
The code was for a binary model but I am hoping to modify it for a multiclass data set. Kindly assist. Thanks!
The error is in the softmax layer.
I think you mean y_train_onehot.shape[1] instead of y_train_onehot[1]
Update 1: This is strange but your second problem seems to be y_train_onehot, would you mind to try 2 things:
try the same model without the onehot encoding on y_train.
if that alone doesn't work, change the loss to sparse_categorical_crossentropy
Also make sure to change y_train_onehot.shape[1] to the number of classes in the softmax layer

What is the replace for softmax layer in case more than one output can be activated?

For example, I have CNN which tries to predict numbers from MNIST dataset (code written using Keras). It has 10 outputs, which form softmax layer. Only one of outputs can be true (independently for each digit from 0 to 9):
Real: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
Predicted: [0.02, 0.9, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]
Sum of predicted is equal to 1.0 due to definition of softmax.
Let's say I have a task where I need to classify some objects that can fall in several categories:
Real: [0, 1, 0, 1, 0, 1, 0, 0, 0, 1]
So I need to normalize in some other way. I need function which gives value on range [0, 1] and which sum can be larger than 1.
I need something like that:
Predicted: [0.1, 0.9, 0.05, 0.9, 0.01, 0.8, 0.1, 0.01, 0.2, 0.9]
Each number is probability that object falls in given category. After that I can use some threshold like 0.5 to distinguish categories in which given object falls.
The following questions appear:
So which activation function can be used for this?
May be this function already exists in Keras?
May be you can propose some other way to predict in this case?
Your problem is one of multi-label classification, and in the context of Keras it is discussed, for example, here: https://github.com/fchollet/keras/issues/741
In short the suggested solution for it in keras is to replace the softmax layer with a sigmoid layer and use binary_crossentropy as your cost function.
an example from that thread:
# Build a classifier optimized for maximizing f1_score (uses class_weights)
clf = Sequential()
clf.add(Dropout(0.3))
clf.add(Dense(xt.shape[1], 1600, activation='relu'))
clf.add(Dropout(0.6))
clf.add(Dense(1600, 1200, activation='relu'))
clf.add(Dropout(0.6))
clf.add(Dense(1200, 800, activation='relu'))
clf.add(Dropout(0.6))
clf.add(Dense(800, yt.shape[1], activation='sigmoid'))
clf.compile(optimizer=Adam(), loss='binary_crossentropy')
clf.fit(xt, yt, batch_size=64, nb_epoch=300, validation_data=(xs, ys), class_weight=W, verbose=0)
preds = clf.predict(xs)
preds[preds>=0.5] = 1
preds[preds<0.5] = 0
print f1_score(ys, preds, average='macro')

How to calculate F1-micro score using lasagne

import theano.tensor as T
import numpy as np
from nolearn.lasagne import NeuralNet
def multilabel_objective(predictions, targets):
epsilon = np.float32(1.0e-6)
one = np.float32(1.0)
pred = T.clip(predictions, epsilon, one - epsilon)
return -T.sum(targets * T.log(pred) + (one - targets) * T.log(one - pred), axis=1)
net = NeuralNet(
# your other parameters here (layers, update, max_epochs...)
# here are the one you're interested in:
objective_loss_function=multilabel_objective,
custom_score=("validation score", lambda x, y: np.mean(np.abs(x - y)))
)
I found this code online and wanted to test it. It did work, the results include training loss, test loss, validation score and during time and so on.
But how can I get the F1-micro score? Also, if I was trying to import scikit-learn to calculate the F1 after adding the following code:
data = data.astype(np.float32)
classes = classes.astype(np.float32)
net.fit(data, classes)
score = cross_validation.cross_val_score(net, data, classes, scoring='f1', cv=10)
print score
I got this error:
ValueError: Can't handle mix of multilabel-indicator and
continuous-multioutput
How to implement F1-micro calculation based on above code?
Suppose your true labels on the test set are y_true (shape: (n_samples, n_classes), composed only of 0s and 1s), and your test observations are X_test (shape: (n_samples, n_features)).
Then you get your net predicted values on the test set by y_test = net.predict(X_test).
If you are doing multiclass classification:
Since in your network you have set regression to False, this should be composed of 0s and 1s only, too.
You can compute the micro averaged f1 score with:
from sklearn.metrics import f1_score
f1_score(y_true, y_pred, average='micro')
Small code sample to illustrate this (with dummy data, use your actual y_test and y_true):
from sklearn.metrics import f1_score
import numpy as np
y_true = np.array([[0, 0, 1], [0, 1, 0], [0, 0, 1], [0, 0, 1], [0, 1, 0]])
y_pred = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1], [0, 0, 1], [0, 0, 1]])
t = f1_score(y_true, y_pred, average='micro')
If you are doing multilabel classification:
You are not outputting a matrix of 0 and 1, but a matrix of probabilities. y_pred[i, j] is the probability that observation i belongs to the class j.
You need to define a threshold value, above which you will say an observation belongs to a given class. Then you can attribute labels accordingly and proceed just the same as in the previous case.
thresh = 0.8 # choose your own value
y_test_binary = np.where(y_test > thresh, 1, 0)
# creates an array with 1 where y_test>thresh, 0 elsewhere
f1_score(y_true, y_pred_binary, average='micro')

Resources