How to reduce false positives in xgboost? - python-3.x

My dataset is evenly split between 0 and 1 classifiers. 100,000 data points total with 50,000 being classified as 0 and another 50,000 classified as 1. I did an 80/20 split to train/test the data and returned a 98% accuracy score. However, when looking at the confusion matrix I have an awful lot of false positives. I'm new to xgboost and decision trees in general. What settings can I change in the XGBClassifier to reduce the number of false positives or is it even possible? Thank you.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0, stratify=y) # 80% training and 20% test
model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.1, max_delta_step=0, max_depth=9,
min_child_weight=1, missing=None, monotone_constraints='()',
n_estimators=180, n_jobs=4, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', use_label_encoder=False,
validate_parameters=1, verbosity=None)
model.fit(X_train,
y_train,
verbose = True,
early_stopping_rounds=10,
eval_metric = "aucpr",
eval_set = [(X_test, y_test)])
plot_confusion_matrix(model,
X_test,
y_test,
values_format='d',
display_labels=['Old Forests', 'Not Old Forests'])

Yes
If you are looking for a simple fix, you lower the value of scale_pos_weight. This will lower false positive rate even though your dataset is balanced.
For a more robust fix, you will need to run hyperparamter tuning search. Especially you should try different values of : scale_pos_weight, alpha, lambda, gamma and min_child_weight. Since they are the ones with the most impact on how conservative the model is going to be.

Related

InceptionV3 transfer learning with Keras overfitting too soon

I'm using a pre trained InceptionV3 on Keras to retrain the model to make a binary image classification (data labeled with 0's and 1's).
I'm reaching about 65% of accuracy on my k-fold validation with never seen data, but the problem is the model is overfitting to soon. I need to improve this average accuracy, and I guess there is something related to this overfitting problem.
Here are the loss values on epochs:
Here is the code. The dataset and label variables are Numpy Arrays.
dataset = joblib.load(path_to_dataset)
labels = joblib.load(path_to_labels)
le = LabelEncoder()
labels = le.fit_transform(labels)
labels = to_categorical(labels, 2)
X_train, X_test, y_train, y_test = sk.train_test_split(dataset, labels, test_size=0.2)
X_train, X_val, y_train, y_val = sk.train_test_split(X_train, y_train, test_size=0.25) # 0.25 x 0.8 = 0.2
X_train = np.array(X_train)
y_train = np.array(y_train)
X_val = np.array(X_val)
y_val = np.array(y_val)
X_test = np.array(X_test)
y_test = np.array(y_test)
aug = ImageDataGenerator(
rotation_range=20,
zoom_range=0.15,
horizontal_flip=True,
fill_mode="nearest")
pre_trained_model = InceptionV3(input_shape = (299, 299, 3),
include_top = False,
weights = 'imagenet')
for layer in pre_trained_model.layers:
layer.trainable = False
x = layers.Flatten()(pre_trained_model.output)
x = layers.Dense(1024, activation = 'relu')(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(2, activation = 'softmax')(x) #already tried with sigmoid activation, same behavior
model = Model(pre_trained_model.input, x)
model.compile(optimizer = RMSprop(lr = 0.0001),
loss = 'binary_crossentropy',
metrics = ['accuracy']) #Already tried with Adam optimizer, same behavior
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=100)
mc = ModelCheckpoint('best_model_inception_rmsprop.h5', monitor='val_accuracy', mode='max', verbose=1, save_best_only=True)
history = model.fit(x=aug.flow(X_train, y_train, batch_size=32),
validation_data = (X_val, y_val),
epochs = 100,
callbacks=[es, mc])
The training dataset has 2181 images and validation has 727 images.
Something is wrong, but I can't tell what...
Any thoughts of what can be done to improve it?
One way to avoid overfitting is to use a lot of data. The main reason overfitting happens is because you have a small dataset and you try to learn from it. The algorithm will have greater control over this small dataset and it will make sure it satisfies all the datapoints exactly. But if you have a large number of datapoints, then the algorithm is forced to generalize and come up with a good model that suits most of the points.
Suggestions:
Use a lot of data.
Use less deep network if you have a small number of data samples.
If 2nd satisfies then don't use huge number of epochs - Using many epochs leads is kinda forcing your model to learn that and your model will learn it well but can not generalize.
From your loss graph , i see that the model is generalized at early epoch ( where there is intersection of both the train & val score) so plz try to use the model saved at that epoch ( and not the later epochs which seems to overfit)
Second option what you have is use lot of training samples..
If you have less no. of training samples then use data augmentations
Have you tried following?
Using a higher dropout value
Lower Learning Rate (lr=0.00001 or lr=0.000001 ...)
More data augmentation you can use.
It seems to me your data amount is low. You may use a lower ratio for test and validation (10%, 10%).

Models evaluation and parameter tuning with CV

I try to compare three models SVM RandomForest and LogisticRegression.
I have an imbalance dataset. First i split it to with a 80% - 20% ratio to train and test set. I set the stratify=y.
Next, i used StratifiedKfold only on train set. What i try to do now is fit the models and choose the best one. Also i want to use grid search for each one of the models to find the best parameters.
My code until now is the next
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, shuffle=True, stratify=y, random_state=42)
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=21)
for train_index, test_index in skf.split(X_train, y_train):
X_train_folds, X_test_folds = X_train[train_index], X_train[test_index]
y_train_folds, y_test_folds = y_train[train_index], y_train[test_index]
X_train_2, X_test_2, y_train_2, y_test_2 = X[train_index], X[test_index], y[train_index], y[test_index]
How can i fit a model usin all the folds? How can i gridsearch? Should i have a doulbe loop? can you help?
You can use scikit-learn's GridSearchCV.
You will find an example here of how to evaluate the performance of the various models and assess the statistical significance of the results.

Model performance is "Good". But coefficient weightings are strange

I am training a model to detect Good/Bad clients. My input features are:
'Net Receivables', 'Sales', 'Cost of Goods sold', 'Current Assets',
'Property, plant and equipment', 'Securities', 'Total assets',
'Depreciation', 'Selling, General & Administrative Expense',
'Total long term debt', 'Current Liabilites', 'Net Receivables.1',
'Sales.1', 'Cost of Goods sold.1', 'Current Assets.1',
'Property, plant and equipment.1', 'Securities.1', 'Total assets.1',
'Depreciation.1', 'Selling, General & Administrative Expense.1',
'Total long term debt.1', 'Current Liabilites.1',
'Income from Continuing Operations', 'Cash Flows from Operations'
I trained a simple model using Logistic Regression:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = LogisticRegression()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
Then I try to evaluate the model using AUC and accuracy
print(roc_auc_score(y_test, pred))
print(accuracy_score(y_test, pred))
The result is
0.765625
0.7727272727272727
But when I try to evaluate the feature importance by
odds = np.exp(clf.coef_[0])
I found some strange coefficients. It seems that no features are relatively more significant
array([1.00000001, 1.00000035, 0.99999963, 0.99999987, 0.99999928,
1. , 1. , 0.99999993, 1.00000019, 0.9999994 ,
0.99999976, 1.00000016, 0.99999996, 1.00000003, 0.99999967,
0.99999967, 1. , 1.00000035, 0.99999995, 0.99999985,
1.00000035, 1.00000021, 1.00000008, 1.00000051])
My training set is relatively small: 174 rows * 24 features.
Can I trust the score of the model?
Why do you use np.exp ?
And why do you do use coef_[0], the normal approach to get the coefficient for your logistic regresion should be:
print(clf.coef_, clf.intercept_)
followed also by this post.

Random subsets of a dataset

I would like to compare the classification performance (accuracy) of different classifiers (e.g. CNN, SVM.....), depending on the size of the training data set.
Given is a dataset of images (e.g., MNIST), from which 80% of the images are randomly determined but in compliance with class balance. Subsequently, 80% of the images for the next smaller subset are to be determined from this subset in the same way again. This is repeated until finally a small training amout of about 1000 images is reached.
Each of the classifiers should now be trained with each these subsets.
The aim is to be able to make a statement like for example that from a training size of 5000 images the classifier A is significantly better than classifier B.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size= 0.2, stratify=y)
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train, y_train, random_state=0, test_size= 0.2, stratify=y_train)
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X_train_2, y_train_2, random_state=0, test_size= 0.8, stratify=y_train_2)
.....
.....
.....
My problem is that I am not sure if this is really random sampling when I use the above code. Would it better to get the subsets, e.g. using numpy.random.randint?
For any help, I would be very grateful.

Why is my detection score high inspite of obvious misclassifications during prediction?

I am working on an intrusion classification problem using NSL-KDD dataset. I used 10 features (out of 42) for training after applying Recursive feature elimination technique using Random Forest Classifier as the estimator parameter and Gini index as criterion for splitting Decision tree. After training the classifier, I use same classifier to predict the classes of test data. My cross validation score (Accuracy, precision, recall, f-score) using cross_val_score of sklearn gave above 99 % scores for all the four scores. But plotting the confusion matrix showed otherwise with higher values seen in False positive and False negative values. Claerly, they are not matching with accuracy and all these scores. Where did I do wrong ?
# Train set contain X_train (dataframe of features) and Y_train (series
# of target labels)
# Test set contain X_test and Y_test
# Classifier variable
clf = RandomForestClassifier(n_estimators = 10, criterion = 'gini')
#Training
clf.fit(X_train, Y_train)
# Testing
Y_pred = clf.predict(X_test)
pandas.crosstab(Y_test, Y_pred, rownames = ['Actual'], colnames =
['Predicted'])
# Scoring
accuracy = cross_val_score(clf, X_test, Y_test, cv = 10, scoring =
'accuracy')
print("Accuracy: %0.5f (+/- %0.5f)" % (accuracy.mean(), accuracy.std() *
2))
precision = cross_val_score(clf, X_test, Y_test, cv = 10, scoring =
'precision_weighted')
print("Precision: %0.5f (+/- %0.5f)" % (precision.mean(), precision.std()
* 2))
recall = cross_val_score(clf, X_test, Y_test, cv = 10, scoring =
'recall_weighted')
print("Recall: %0.5f (+/- %0.5f)" % (recall.mean(), recall.std() * 2))
f = cross_val_score(clf, X_test, Y_test, cv = 10, scoring = 'f1_weighted')
print("F-Score: %0.5f (+/- %0.5f)" % (f.mean(), f.std() * 2))
I got accuracy, precision, recall and f-score of
Accuracy 0.99825
Precision 0.99826
Recall 0.99825
F-Score 0.99825
However, the confusion matrix showed otherwise
Predicted 9670 41
Actual 5113 2347
Am I training the whole thing wrong or is it just misclassification problem from poor feature selection?
Your predicted values are stored in y_pred.
accuracy_score(y_test,y_pred)
Just check whether this works...
You are not comparing equivalent results! For the confusion matrix, you train on (X_train,Y_train) and test on (X_test,Y_test).
However, the crossvalscore fits the estimator on k-1 folds of (X_test,Y_test) and test it on the remaining fold of (X_test,Y_test) because crossvalscore do its own cross-validation (with 10 folds here) on the dataset you provide. Check out crossvalscore documentation for more explanation.
So basically, you don't fit and test your algorithm on the same data. This might explain some inconsistency in the results.

Resources