Imbalanced dataset using MLP classifier in python - python-3.x

I am dealing with imbalanced dataset and I try to make a predictive model using MLP classifier. Unfortunately the algorithm classifies all the observations from test set to class "1" and hence the f1 score and recall values in classification report are 0. Does anyone know how to deal with it?
model= MLPClassifier(solver='lbfgs', activation='tanh')
model.fit(X_train, y_train)
score=accuracy_score(y_test, model.predict(X_test), )
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:,1])
roc=roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
cr=classification_report(y_test, model.predict(X_test))

There are few techniques to handle the imbalanced dataset. A fully dedicated python library "imbalanced-learn" is available here. But one should be cautious about which technique should be used in a specific case.
Few interesting examples are also available at https://svds.com/learning-imbalanced-classes/

Related

Does it make sense to use scikit-learn cross_val_predict() to (i)make predictions with unseen data in k-fold cross-validation and (ii)compare models?

I'm training and evaluating a logistic regression and a XGBoost classifier.
With the XGBoost classifier, a training/validation/test split of the data and the subsequent training and validation shows the model is overfitting the training data. So, I'm working with k-fold cross-validation to reduce overfitting.
To work with k-fold cross-validation, I'm splitting my data into training and test sets and performing the k-fold cross-validation on the training set. The code looks something like the following:
model = XGBClassifier()
kfold = StratifiedKFold(n_splits = 10)
results = cross_val_score(model, x_train, y_train, cv = kfold)
The code works. Now, I've read several forums and blogs on how to make predictions after a k-fold cross-validation, but after these readings, I'm still not sure about the proper way of doing the predictions.
It would seem that using the cross_val_predict() method from sklearn.model_selection and using the test set is OK. The code would look something like the following:
y_pred = cross_val_predict(model, x_test, y_test, cv = kfold)
The code works, but the issue is whether this makes sense since I've seen more complicated ways of doing so and where it doesn't seem clear whether the training or the test set should be used for the predictions.
And if this makes sense, computing the accuracy score and the confusion matrix would be as simple as running something like the following:
accuracy = metrics.accuracy_score(y_test, y_pred)
cm = metrics.confusion_matrix(y_test, y_pred)
These two would help compare the logistic regression and the XGBoost classifier. Does this way of making predictions and evaluating models make sense?
Any help is appreciated! Thanks!
I want to answer this question I posted myself by summarizing things I have read and tried.
First, I want to clarify that the idea behind splitting my data into training/test sets and performing the k-fold cross-validation on the training set is to reserve the test set for providing a generalization error in much the same way we split data into training/validation/test sets and use the test set for providing a generalization error. For the sake of clarity, let me split the discussion into 2 sections.
Section 1
Now, reading more stuff, it's clearer to me cross_val_predict() returns the predictions that were obtained during the cross-validation when the elements were in a test set (see section 3.1.1.2 in this scikit-learn cross-validation doc). This test set refers to one of the test sets the cross-validation procedure internally creates (cross-validation creates a test set in each fold). Thus:
y_pred = cross_val_predict(model, x_train, y_train, cv = kfold)
returns the predictions from the cross-validation internal test sets. It then seems safe to obtain the accuracy and confusion matrix with:
accuracy = metrics.accuracy_score(y_train, y_pred)
cm = metrics.confusion_matrix(y_train, y_pred)
While cross_val_predict(model, x_test, y_test, cv = kfold) runs, it seems doing this doesn't make much sense.
Section 2
From some blogs that talk about creating a confusion matrix after a cross-validation procedure (see here and here), I borrowed code that, for each fold of the cross-validation, extracts the labels and predictions from the internal test set. These labels and predictions are later used to compute the confusion matrix. Assuming I store the labels and predictions in variables called actual_classes and predicted_classes, respectively, I then run:
accuracy = metrics.accuracy_score(actual_classes, predicted_classes)
cm = metrics.confusion_matrix(actual_classes, predicted_classes)
The results are exactly the same as the ones from Section 1's equivalent code. This reinforces that cross_val_predict(model, x_train, y_train, cv = kfold) works fine.
Thus:
Does it make sense to use scikit-learn cross_val_predict() to make
predictions with unseen data in k-fold cross-validation? I would say
No, it doesn't since cross_val_predict() makes predictions with
the internal test sets from the cross-validation procedure. It
seems that to make predictions with unseen data and compute a
generalization error we would need a way to extract one of the
models from the cross-validation procedure (e.g., see this
question)
Does it make sense to use scikit-learn cross_val_predict() to
compare models? I would say Yes, it does as long as the method is
executed as shown in Section 1. The accuracy and confusion matrix
could be used to make comparisons against other models.
Any comment is appreciated! Thanks!

Can I use BERT as a feature extractor without any finetuning on my specific data set?

I'm trying to solve a multilabel classification task of 10 classes with a relatively balanced training set consists of ~25K samples and an evaluation set consists of ~5K samples.
I'm using the huggingface:
model = transformers.BertForSequenceClassification.from_pretrained(...
and obtain quite nice results (ROC AUC = 0.98).
However, I'm witnessing some odd behavior which I don't seem to make sense of -
I add the following lines of code:
for param in model.bert.parameters():
param.requires_grad = False
while making sure that the other layers of the model are learned, that is:
[param[0] for param in model.named_parameters() if param[1].requires_grad == True]
gives
['classifier.weight', 'classifier.bias']
Training the model when configured like so, yields some embarrassingly poor results (ROC AUC = 0.59).
I was working under the assumption that an out-of-the-box pre-trained BERT model (without any fine-tuning) should serve as a relatively good feature extractor for the classification layers. So, where do I got it wrong?
From my experience, you are going wrong in your assumption
an out-of-the-box pre-trained BERT model (without any fine-tuning) should serve as a relatively good feature extractor for the classification layers.
I have noticed similar experiences when trying to use BERT's output layer as a word embedding value with little-to-no fine-tuning, which also gave very poor results; and this also makes sense, since you effectively have 768*num_classes connections in the simplest form of output layer. Compared to the millions of parameters of BERT, this gives you an almost negligible amount of control over intense model complexity. However, I also want to cautiously point to overfitted results when training your full model, although I'm sure you are aware of that.
The entire idea of BERT is that it is very cheap to fine-tune your model, so to get ideal results, I would advise against freezing any of the layers. The one instance in which it can be helpful to disable at least partial layers would be the embedding component, depending on the model's vocabulary size (~30k for BERT-base).
I think the following will help in demystifying the odd behavior I reported here earlier –
First, as it turned out, when freezing the BERT layers (and using an out-of-the-box pre-trained BERT model without any fine-tuning), the number of training epochs required for the classification layer is far greater than that needed when allowing all layers to be learned.
For example,
Without freezing the BERT layers, I’ve reached:
ROC AUC = 0.98, train loss = 0.0988, validation loss = 0.0501 # end of epoch 1
ROC AUC = 0.99, train loss = 0.0484, validation loss = 0.0433 # end of epoch 2
Overfitting, train loss = 0.0270, validation loss = 0.0423 # end of epoch 3
Whereas, when freezing the BERT layers, I’ve reached:
ROC AUC = 0.77, train loss = 0.2509, validation loss = 0.2491 # end of epoch 10
ROC AUC = 0.89, train loss = 0.1743, validation loss = 0.1722 # end of epoch 100
ROC AUC = 0.93, train loss = 0.1452, validation loss = 0.1363 # end of epoch 1000
The (probable) conclusion that arises from these results is that working with an out-of-the-box pre-trained BERT model as a feature extractor (that is, freezing its layers) while learning only the classification layer suffers from underfitting.
This is demonstrated in two ways:
First, after running 1000 epochs, the model still hasn’t finished learning (the training loss is still higher than the validation loss).
Second, after running 1000 epochs, the loss values are still higher than the values achieved with the non-freeze version as early as the 1’st epoch.
To sum it up, #dennlinger, I think I completely agree with you on this:
The entire idea of BERT is that it is very cheap to fine-tune your model, so to get ideal results, I would advise against freezing any of the layers.

SMOTE over sampling applied on text classification

I am working on text classification, where I am using Multinominal Naive Bayes Classifier to predict article titles into their respective subject categories. Both of these are stored in a pandas data frame and are text columns. However they're are two categories which contain 50,000 records and 30,000 records respectively. Hence I need to do oversampling of the data and then apply the algorithm. When I do oversampling it reduces the model accuracy score and give me 15%. Please tell me how I can improve it.
X_train, X_test, Y_train, Y_test=train_test_split(df['Title'],df['Subjects'], test_size=0.2,random_state=42)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train_tfidf, Y_train)
print("Shape after smote is:",X_train_res.shape,y_train_res.shape)
nb = Pipeline([('clf', MultinomialNB())])
nb.fit(X_train_res, y_train_res)
y_pred = nb.predict(count_vect.transform(X_test))
print(accuracy_score(Y_test,y_pred))
I expect to increase model accuracy by doing so. Model accuracy without oversampling is 62% and after oversampling is 15%, when it should actually be higher.
Actually, using SMOTE for balancing/oversampling classes can be problematic in text classification tasks. There are nice explanations and suggestions for alternatives here:
https://datascience.stackexchange.com/a/27758
In short, the SMOTE output may not represent "meaningful" substitutes and due to the size of the feature space its nearest-neighbor based approach may yield poor results.
Some more ideas:
Instead of using accuracy, it is advisable to use F1 or similar.
Rather unlikely to help but did you try undersampling?
For the MultinomialNB classifier you might try setting class_prior explicitly.
Finally, other methods like Forests and Boosting approaches might be better suited for imbalanced datasets.

How to use ModelCheckpoint() in Keras with weighted validation loss

I train a DNN in Keras which has high imbalanced classes. So I used class_weight in fit_generator to correct this. Now I want to save the model with the lowest weighted validation loss using the ModelCheckpoint() function. I am trying but I can't figure out the way to achieve this. Would any one have a simple example?
ModelCheckpoint("checkpoint.hdf5", monitor='val_loss', mode = 'min', verbose=1, save_best_only = True)
model.fit_genetor(....)
I think you are asking for this piece of code.

Logistic Regression in python. probability threshold

So I am approaching the classification problem with logistic regression algorithm and I obtain all of the predictions for the test set for class "1". The set is very imbalanced as it has over 200k inputs and more or less 92% are from class "1". Logistic regression generally classifies the input to class "1" if the P(Y=1|X)>0.5. So since all of the observations in test set are being classified into class 1 I thought that maybe there is a way to change this threshold and set it for example to 0.75 so that only observations with P(Y=1|X)>0.75 are classified to class 1 and otherwise class 0. How to implement it in python?
model= LogisticRegression(penalty='l2', C=1)
model.fit(X_train, y_train)
score=accuracy_score(y_test, model2.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, model2.predict_proba(X_test)[:,1])
roc=roc_auc_score(y_test, model2.predict_proba(X_test)[:,1])
cr=classification_report(y_test, model2.predict(X_test))
PS. Since all the observations from test set are being classified to class 1 the F1 score and recall from classification report are 0. Maybe by changing the threshold this problem will be solved.
A thing you might want to try is balancing the classes instead of changing the threshold. Scikit-learn is supporting this via class_weights. For example you could try model = LogisticRegression(penalty='l2', class_weight='balanced', C=1). Look at the documentation for more details:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Resources