How to avoid overfitting with imbalanced data? - python-3.x

I am working on a classifier for binary classification. The data is imbalanced with class 0 of of 83.41% and class 1 of 16.59%. I am using Mathews Correlation coefficient to evaluate the performance of the classifier. Also note that the data is quite less with dimension ((211, 800)).
I am using Logistic regression to address the problem. I used GridSearchCV for hyper parameter optimisation and came up with the following best hyper parameter values :
Best Params: {'C': 1000, 'class_weight': {1: 0.83, 0: 0.17000000000000004}, 'penalty': 'l1', 'solver': 'liblinear'}
Best MCC 0.7045053547679334
I plotted the validation curve over a range of C values to check wether the model is overfitting/underfitting.
train_scores, test_scores = validation_curve(LogisticRegression(penalty='l1',
solver='liblinear',
class_weight={1: 0.83, 0: 0.17000000000000004}),
X, y,'C', C, cv=5, scoring=make_scorer(matthews_corrcoef))
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.title("Validation Curve with Logistic Regression")
plt.xlabel("C")
plt.ylabel("MCC")
plt.ylim(0.0, 1.1)
lw = 2
plt.semilogx(C, train_scores_mean, label="Training score",
color="darkorange", lw=lw)
plt.fill_between(C, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.2,
color="darkorange", lw=lw)
plt.semilogx(C, test_scores_mean, label="Cross-validation score",
color="navy", lw=lw)
plt.fill_between(C, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.2,
color="navy", lw=lw)
plt.legend(loc="best")
plt.show()
Based on my understanding seeing this curve it shows that the model tends to overfit as it performs low on validation set and high on training set. Could anyone point me into some direction as how to address this on such a small dataset.

You could do a number of things:
Use SMOTE to oversample the minority class.
Reduce the number of iterations of GridSearchCV or use RandomSearchCV.

Related

F1 metric and LeaveOneOut validation strategy in scikit-learn

I want to use GridSearchCV to find the optimal n_neighbors parameter of KNeighborsClassifier
I want to use 'f1_score' metrics AND 'leave one out' strategy.
But this code
clf = GridSearchCV(KNeighborsClassifier(), {'n_neighbors': [1, 2, 3]}, cv=LeaveOneOut(), scoring='f1')
clf.fit(x_train, y_train)
leads to an error
UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
I want to compute f1 score not of each fold of cross validation (it is not possible to compute f1 score of the only one test example), but to compute f1 score based on the whole iteration set with n_neighbors = n.
Is it possible using GridSearchCV?
Not sure if this functionality is directly available in Scikit-Learn, but you can implement the following function to get the desired outcome.
In particular, we will make a dummy scorer which just returns the predicted class instead of computing any score using the ground-truth and the prediction. In this way we can access the predictions of each hyperparameters combination on the different examples in the LOO cv.
from sklearn.metrics import f1_score, make_scorer
def get_pred(y_true, y_predicted):
return y_predicted
get_pred_scorer = make_scorer(get_pred)
clf = GridSearchCV(
KNeighborsClassifier(),
{'n_neighbors': [1, 2, 3]},
cv=LeaveOneOut(),
refit=False,
scoring=get_pred_scorer
)
clf.fit(X_train, y_train)
The problem with this approach is that certain results available in the cv_results_ dictionary (and in certain attributes of GridSearchCV) won't have any meaning, but that probably is not a problem. We should just remember to put refit=False, since GridSearchCV doesn't have a way to determine the best model.
Now we can access the predictions through cv_results_ and just use f1_score to compute the metric for each hyperparams configuration.
def print_params_f1_scores(clf, y_true):
y_preds = [] # will contain the predictions of each params combination
results = clf.cv_results_
params = results["params"] # all params combinations
for j in range(len(params)): # for each combination
y_preds.append([])
for i in range(clf.n_splits_): # for each split (sample in loo)
prediction_of_j_on_i = results[f"split{i}_test_score"][j]
y_preds[j].append(prediction_of_j_on_i)
# show the f1-scores of each combination
for j in range(len(y_preds)):
score = f1_score(y_true, y_preds[j])
print(f"KNeighborsClassifier with {params[j]} obtained f1-score of {score}")
print_params_f1_scores(clf, y_train)
The function prints the following output:
KNeighborsClassifier with {'n_neighbors': 1} obtained f1-score of 0.94
KNeighborsClassifier with {'n_neighbors': 2} obtained f1-score of 0.94
KNeighborsClassifier with {'n_neighbors': 3} obtained f1-score of 0.92

Change learning rate within minibatch - keras

I have a problem with imbalanced labels, for example 90% of the data have the label 0 and the rest 10% have the label 1.
I want to teach the network with minibatches. So I want the optimizer to give the examples labeled with 1 a learning rate (or somehow change the gradients to be) greater by 9 than those with label 0.
is there any way of doing that?
The problem is that the whole training process is done in this line:
history = model.fit(trainX, trainY, epochs=1, batch_size=minibatch_size, validation_data=(valX, valY), verbose=0)
is there a way to change the fit method in the low level?
You can try using the class_weight parameter of keras.
From keras doc:
class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only).
Example of using it in imbalance data:
https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#class_weights
class_weights={"class_1": 1, "class_2": 10}
history = model.fit(trainX, trainY, epochs=1, batch_size=minibatch_size, validation_data=(valX, valY), verbose=0, class_weight=class_weights)
Full example:
# Examine the class label imbalance
# you can use your_df['label_class_column'] or just the trainY values.
neg, pos = np.bincount(your_df['label_class_column'])
total = neg + pos
print('Examples:\n Total: {}\n Positive: {} ({:.2f}% of total)\n'.format(
total, pos, 100 * pos / total))
# Scaling by total/2 helps keep the loss to a similar magnitude.
# The sum of the weights of all examples stays the same.
weight_for_0 = (1 / neg)*(total)/2.0
weight_for_1 = (1 / pos)*(total)/2.0
class_weight = {0: weight_for_0, 1: weight_for_1}

Why doesn't GridSearchCV give C with highest AUC when scoring roc_auc in logistic regression

I'm new to this so apologies if this is obvious.
lr = LogisticRegression(penalty = 'l1')
parameters = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
clf = GridSearchCV(lr, parameters, scoring='roc_auc', cv = 5)
clf.fit(X, Y)
print clf.score(X, Y)
tn, fp, fn, tp = metrics.confusion_matrix(Y, clf.predict(X)).ravel()
print tn, fp, fn, tp
I want to run a Logistic Regression - I'm using an L1 penalty because I want to reduce the number of features I use. I'm using GridSearchCV to find the best C value for the Logistic Regression
I run this and get C = 0.001, AUC = 0.59, Confusion matrix: 46, 0, 35, 0. Only 1 feature has a non-zero coefficient.
I go back to my code and remove the option of C = 0.001 from my parameter list and run it again.
Now I get C = 1, AUC = 0.95, Confusion matrix: 42, 4, 6, 29. Many, but not all, features have a non-zero coefficient.
I thought that since I have scoring as 'roc_auc' shouldn't the model be created with a better AUC?
Thinking this may be to do with my l1 penalty I switched it to l2. But this gave C = 0.001, AUC = 0.80, CM = 42,4,16,19 and again when I removed C = 0.001 as an option it gave C = 0.01, AUC = 0.88, CM = 41,5,13,22.
There is less of an issue with the l2 penalty but it seems to be a pretty big difference in l1. Is it a penalty thing?
From some of my readings I know ElasticNet is supposed to combine some l1 and l2 - is that where I should be looking?
Also, not completely relevant but while I'm posting - I haven't done any data normalization for this. That is normal for Logistic Regression?
clf.score(X, Y) is the score on the training dataset (the gridsearch refits the model on the entire dataset after it's chosen the best parameters), you don't want to use this to evaluate your model. This also isn't what the gridsearch uses internally in its model selection, instead it uses cross-validated folds and takes the average. You can access the actual score used in the model selection with clf.best_score_.

Evaluating the performance of one class SVM

I have been trying to evaluate the performance of my one-class SVM. I have tried plotting an ROC curve using scikit-learn, and the results have been a bit bizarre.
X_train, X_test = train_test_split(compressed_dataset,test_size = 0.5,random_state = 42)
clf = OneClassSVM(nu=0.1,kernel = "rbf", gamma =0.1)
y_score = clf.fit(X_train).decision_function(X_test)
pred = clf.predict(X_train)
fpr,tpr,thresholds = roc_curve(pred,y_score)
# Plotting roc curve
plt.figure()
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()
The ROC curve I get:
Can somebody help me out with this?
What is bizzare about this plot? You fixed a single set of nu and gamma, thus your model is neither over- nor underfitting. Moving threshold (which is a ROC variable) does not lead to 100% TPR. Try out high gamma and very small nu (which upper bounds the training errors) and you will get more "typical" plots.
In my opinon, get the scores:
pred_scores = clf.score_samples(X_train)
Then, the pred_scores need to be min-max normalized before min-max normalize

Very few distinct prediction probabilities for CV instances with sparse SVM

I’m having an issue using the prediction probabilities for sparse SVM, where many of the predictions come out the same for my test instances. These probabilities are produced during cross validation, and when I plot an ROC curve for the folds, the results look very strange, as there are a handful of clustered points on the graph. Here is my cross validation code, I based it off of the samples on the scikit website:
skf = StratifiedKFold(y, n_folds=numfolds)
for train_index, test_index in skf:
#split the training and testing sets
X_train, X_test = X_scaled[train_index], X_scaled[test_index]
y_train, y_test = y[train_index], y[test_index]
#train on the subset for this fold
print 'Training on fold ' + str(fold)
classifier = svm.SVC(C=C_val, kernel='rbf', gamma=gamma_val, probability=True)
probas_ = classifier.fit(X_train, y_train).predict_proba(X_test)
#Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1])
mean_tpr += interp(mean_fpr, fpr, tpr)
mean_tpr[0] = 0.0
roc_auc = auc(fpr, tpr)
I’m just trying to figure out if there’s something I’m obviously missing here, since I used this same training set and SVM parameters with libsvm and got much better results. When I used libsvm and printed out the distances from the hyperplane for the CV test instances and then plotted the ROC, it came out much more like I expected, and a much better AUC.

Resources