SMOTE over sampling applied on text classification - python-3.x

I am working on text classification, where I am using Multinominal Naive Bayes Classifier to predict article titles into their respective subject categories. Both of these are stored in a pandas data frame and are text columns. However they're are two categories which contain 50,000 records and 30,000 records respectively. Hence I need to do oversampling of the data and then apply the algorithm. When I do oversampling it reduces the model accuracy score and give me 15%. Please tell me how I can improve it.
X_train, X_test, Y_train, Y_test=train_test_split(df['Title'],df['Subjects'], test_size=0.2,random_state=42)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train_tfidf, Y_train)
print("Shape after smote is:",X_train_res.shape,y_train_res.shape)
nb = Pipeline([('clf', MultinomialNB())])
nb.fit(X_train_res, y_train_res)
y_pred = nb.predict(count_vect.transform(X_test))
print(accuracy_score(Y_test,y_pred))
I expect to increase model accuracy by doing so. Model accuracy without oversampling is 62% and after oversampling is 15%, when it should actually be higher.

Actually, using SMOTE for balancing/oversampling classes can be problematic in text classification tasks. There are nice explanations and suggestions for alternatives here:
https://datascience.stackexchange.com/a/27758
In short, the SMOTE output may not represent "meaningful" substitutes and due to the size of the feature space its nearest-neighbor based approach may yield poor results.
Some more ideas:
Instead of using accuracy, it is advisable to use F1 or similar.
Rather unlikely to help but did you try undersampling?
For the MultinomialNB classifier you might try setting class_prior explicitly.
Finally, other methods like Forests and Boosting approaches might be better suited for imbalanced datasets.

Related

Does it make sense to use scikit-learn cross_val_predict() to (i)make predictions with unseen data in k-fold cross-validation and (ii)compare models?

I'm training and evaluating a logistic regression and a XGBoost classifier.
With the XGBoost classifier, a training/validation/test split of the data and the subsequent training and validation shows the model is overfitting the training data. So, I'm working with k-fold cross-validation to reduce overfitting.
To work with k-fold cross-validation, I'm splitting my data into training and test sets and performing the k-fold cross-validation on the training set. The code looks something like the following:
model = XGBClassifier()
kfold = StratifiedKFold(n_splits = 10)
results = cross_val_score(model, x_train, y_train, cv = kfold)
The code works. Now, I've read several forums and blogs on how to make predictions after a k-fold cross-validation, but after these readings, I'm still not sure about the proper way of doing the predictions.
It would seem that using the cross_val_predict() method from sklearn.model_selection and using the test set is OK. The code would look something like the following:
y_pred = cross_val_predict(model, x_test, y_test, cv = kfold)
The code works, but the issue is whether this makes sense since I've seen more complicated ways of doing so and where it doesn't seem clear whether the training or the test set should be used for the predictions.
And if this makes sense, computing the accuracy score and the confusion matrix would be as simple as running something like the following:
accuracy = metrics.accuracy_score(y_test, y_pred)
cm = metrics.confusion_matrix(y_test, y_pred)
These two would help compare the logistic regression and the XGBoost classifier. Does this way of making predictions and evaluating models make sense?
Any help is appreciated! Thanks!
I want to answer this question I posted myself by summarizing things I have read and tried.
First, I want to clarify that the idea behind splitting my data into training/test sets and performing the k-fold cross-validation on the training set is to reserve the test set for providing a generalization error in much the same way we split data into training/validation/test sets and use the test set for providing a generalization error. For the sake of clarity, let me split the discussion into 2 sections.
Section 1
Now, reading more stuff, it's clearer to me cross_val_predict() returns the predictions that were obtained during the cross-validation when the elements were in a test set (see section 3.1.1.2 in this scikit-learn cross-validation doc). This test set refers to one of the test sets the cross-validation procedure internally creates (cross-validation creates a test set in each fold). Thus:
y_pred = cross_val_predict(model, x_train, y_train, cv = kfold)
returns the predictions from the cross-validation internal test sets. It then seems safe to obtain the accuracy and confusion matrix with:
accuracy = metrics.accuracy_score(y_train, y_pred)
cm = metrics.confusion_matrix(y_train, y_pred)
While cross_val_predict(model, x_test, y_test, cv = kfold) runs, it seems doing this doesn't make much sense.
Section 2
From some blogs that talk about creating a confusion matrix after a cross-validation procedure (see here and here), I borrowed code that, for each fold of the cross-validation, extracts the labels and predictions from the internal test set. These labels and predictions are later used to compute the confusion matrix. Assuming I store the labels and predictions in variables called actual_classes and predicted_classes, respectively, I then run:
accuracy = metrics.accuracy_score(actual_classes, predicted_classes)
cm = metrics.confusion_matrix(actual_classes, predicted_classes)
The results are exactly the same as the ones from Section 1's equivalent code. This reinforces that cross_val_predict(model, x_train, y_train, cv = kfold) works fine.
Thus:
Does it make sense to use scikit-learn cross_val_predict() to make
predictions with unseen data in k-fold cross-validation? I would say
No, it doesn't since cross_val_predict() makes predictions with
the internal test sets from the cross-validation procedure. It
seems that to make predictions with unseen data and compute a
generalization error we would need a way to extract one of the
models from the cross-validation procedure (e.g., see this
question)
Does it make sense to use scikit-learn cross_val_predict() to
compare models? I would say Yes, it does as long as the method is
executed as shown in Section 1. The accuracy and confusion matrix
could be used to make comparisons against other models.
Any comment is appreciated! Thanks!

What steps should I take next to improve my accuracy? Can data be the problem?

I built various ML models using sklearn for a binary classification problem. The data-set is provided to me by my professor for this comparative study.
my jupyter notebook and dataset can be found here
As I am getting very low accuracy, I fear that I must be doing something wrong while building the model. So I tested my decision tree on the inbuilt data-set in sklearn (breast cancer data-set) which is very similar to my data-set as both are binary classifications. Here I get an mean accuracy of 95 %. So I think right now that the problem might be my data-set. Can I get some help on how do I pre-process my data or any other steps that I might look into to improve accuracy.
Encode labels
Categorical data are variables that contain label values rather than numeric values.The number of possible values is often limited to a fixed set.
For example, users are typically described by country, gender, age group etc. We will use Label Encoder to label the categorical data. Label Encoder is the part of SciKit Learn library in Python and used to convert categorical data, or text data, into numbers, which our predictive models can better understand.
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
Feature scaling
Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations. We need to bring all features to the same level of magnitudes. This can be achieved by scaling. This means that you’re transforming your data so that it fits within a specific scale, like 0–100 or 0–1. We will use StandardScaler method from SciKit-Learn library.
#Feature Scalingfrom sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Choosing Right model
You kight also want to vhoose the appropriate model. You can't just use neural nets or so for all problems it's the no free luch theorem. For this you could use K-fold cross validation, AIC and BIC

does training on total dataset improves confidence scores

I'm using SVC(kernel="linear", probability=True) in multiclass classification. when I'm using 2/3rd of my data for training purpose, I'm getting ~72%. And when I tried to predict in production, Confidence scores I'm getting are very less. Does training on the total dataset helps to improve confidence scores?
Does training on the total dataset helps to improve confidence scores?
It might. In general, the more data the better. However evaluating performance should be done on data that the model has not seen before. One way to do this is to set aside a part of the data, a test set, as you have done. Another approach is to use cross-validation, see below.
And when I tried to predict in production, Confidence scores I'm getting are very less.
This means that your model does not generalize well. In other words when presented with data it has not seen before the model starts to make more or less random predictions.
To get a better sense of how well your model generalizes you may want to use cross-validation:
from sklearn.model_selection import cross_val_score
clf = SVC()
scores = cross_val_score(clf, X, Y)
This will train and evaluate your classifier on the full dataset using folds of the full data. A fold For each split the classifier is trained and validation on an exclusive subset of the data. For each split the scores result contains the validation score (for SVC, the accuracy). If you need more control over which metrics to evaluate, use the cross_validation function.
to predict in production
In order to improve your model's performance, there are several methods to consider:
Use more training data
Use an ensemble model to reduce prediction variance
Use a different model (algorithm)

Imbalanced dataset using MLP classifier in python

I am dealing with imbalanced dataset and I try to make a predictive model using MLP classifier. Unfortunately the algorithm classifies all the observations from test set to class "1" and hence the f1 score and recall values in classification report are 0. Does anyone know how to deal with it?
model= MLPClassifier(solver='lbfgs', activation='tanh')
model.fit(X_train, y_train)
score=accuracy_score(y_test, model.predict(X_test), )
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:,1])
roc=roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
cr=classification_report(y_test, model.predict(X_test))
There are few techniques to handle the imbalanced dataset. A fully dedicated python library "imbalanced-learn" is available here. But one should be cautious about which technique should be used in a specific case.
Few interesting examples are also available at https://svds.com/learning-imbalanced-classes/

Scikit SVM gives very poor accuracy for STL-10 dataset

I am using Scikit-learn SVM for training my model for STL-10 dataset which contains 5000 training images (10 pre-defined folds). So I have 5000*96*96*3 size dataset for training and test purposes. I used following code to train it and measure the accuracy for the test set. (80% 20%). Final result was 0.323 accuracy. How can I increase the accuracy for SVM.
This is STL10 dataset
def train_and_evaluate(clf, train_x, train_y):
clf.fit(train_x, train_y)
#make 2D array as we can apply only 2d to fit() function
nsamples, nx, ny, nz = images.shape
reshaped_train_dataset = images.reshape((nsamples, nx * ny * nz))
X_train, X_test, Y_train, Y_test = train_test_split(reshaped_train_dataset, read_labels(LABEL_PATH), test_size=0.20, random_state=33)
train_and_evaluate(my_svc, X_train, Y_train)
print(metrics.accuracy_score(Y_test, clf2.predict(X_test)))
So it seems you are using raw SVM directly on the images. That is usually not a good idea (it is rather bad actually).
I will describe the classic image-classification pipeline popular in the last decades! Keep in mind, that the highest performing approaches right now might use Deep Neural Networks to combine some of these steps (a very different approach; a lot of research in the last years!)
First step:
Preprocessing is needed!
Normalize mean and variance (i would not expect your dataset to be already normalized)
Optional: histogram-equalization
Second step:
Feature-extraction -> you should learn some features from these images. There are a lot of approaches including
(Kernel-)PCA
(Kernel-)LDA
Dictionary-learning
Matrix-factorization
Local binary patterns
... (just test with LDA initially)
Third:
SVM for classification
again there might be a Normalization-step needed before this and as mentioned in the comments by #David Batista: there might be some parameter-tuning needed (especially for Kernel-SVM)
It is also not clear, if using color-information is wise here. For more simple approaches i expect black-and-white images to be superior (you are losing information but tuning your pipeline is more robust; high-performance approaches will of course use color-information).
See here for some random tutorial describing a similar problem. While i don't know if it's good work, you could immediatly recognize the processing-pipeline mentioned above (preprocessing, feature-extraction, classifier-learning)!
Edit:
Why preprocessing?: some algorithms assume centered samples with unit-variance, therefore normalization is needed. This is (at least) very important for PCA, LDA and SVM's.

Resources