using sklearn.train_test_split for Imbalanced data - python-3.x

I have a very imbalanced dataset. I used sklearn.train_test_split function to extract the train dataset. Now I want to oversample the train dataset, so I used to count number of type1(my data set has 2 categories and types(type1 and tupe2) but approximately all of my train data are type1. So I cant oversample.
Previously I used to split train test datasets with my written code. In that code 0.8 of all type1 data and 0.8 of all type2 data were in the train dataset.
How I can use this method with train_test_split function or other spliting methods in sklearn?
*I should just use sklearn or my own written methods.

You're looking for stratification. Why?
There's a parameter stratify in method train_test_split to which you can give the labels list e.g. :
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.2)
There's also StratifiedShuffleSplit.

It seems like we both had similar issues here. Unfortunately, imbalanced-learn isn't always what you need and scikit does not offer the functionality you want. You will want to implement your own code.
This is what I came up for my application. Note that I have not had extensive time to debug it but I believe it works from the testing I have done. Hope it helps:
def equal_sampler(classes, data, target, test_frac):
# Find the least frequent class and its fraction of the total
_, count = np.unique(target, return_counts=True)
fraction_of_total = min(count) / len(target)
# split further into train and test
train_frac = (1-test_frac)*fraction_of_total
test_frac = test_frac*fraction_of_total
# initialize index arrays and find length of train and test
train=[]
train_len = int(train_frac * data.shape[0])
test=[]
test_len = int(test_frac* data.shape[0])
# add values to train, drop them from the index and proceed to add to test
for i in classes:
indeces = list(target[target ==i].index.copy())
train_temp = np.random.choice(indeces, train_len, replace=False)
for val in train_temp:
train.append(val)
indeces.remove(val)
test_temp = np.random.choice(indeces, test_len, replace=False)
for val in test_temp:
test.append(val)
# X_train, y_train, X_test, y_test
return data.loc[train], target[train], data.loc[test], target[test]
For the input, classes expects a list of possible values, data expects the dataframe columns used for prediction, target expects the target column.
Take care that the algorithm may not be extremely efficient, due to the triple for-loop(list.remove takes linear time). Despite that, it should be reasonably fast.

You may also look into stratified shuffle split as follows:
# We use a utility to generate artificial classification data.
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
X, y = make_classification(n_samples=100, n_informative=10, n_classes=2)
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Related

Different R-squared scores for different times

I just learned about cross-validation and when I give in different arguments, there are different results.
This is the code for building the Regression Model and the R-squared output was about .5 :
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np
boston = load_boston()
X = boston.data
y = boston['target']
X_rooms = X[:,5]
X_train, X_test, y_train, y_test = train_test_split(X_rooms, y)
reg = LinearRegression()
reg.fit(X_train.reshape(-1,1), y_train)
prediction_space = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1,1)
plt.scatter(X_test, y_test)
plt.plot(prediction_space, reg.predict(prediction_space), color = 'black')
reg.score(X_test.reshape(-1,1), y_test)
Now when I give the cross-validation for X_train, X_test, and X(respectively), it shows different R-squared values.
Here's the X_test and y_test arguments:
from sklearn.model_selection import cross_val_score
cv = cross_val_score(reg, X_test.reshape(-1,1), y_test, cv = 8)
cv
The result:
array([ 0.42082715, 0.6507651 , -3.35208835, 0.6959869 , 0.7770039 ,
0.59771158, 0.53494622, -0.03308137])
Now when I use the arguments, X_train and y_train, different results are outputted.
from sklearn.model_selection import cross_val_score
cv = cross_val_score(reg, X_train.reshape(-1,1), y_train, cv = 8)
cv
The result:
array([0.46500321, 0.27860944, 0.02537985, 0.72248968, 0.3166983 ,
0.51262191, 0.53049663, 0.60138472])
Now, when I input different arguments again; this time X(which in my case is X_rooms) and y, I yet again get different R-squared values.
from sklearn.model_selection import cross_val_score
cv = cross_val_score(reg, X_rooms.reshape(-1,1), y, cv = 8)
cv
The output:
array([ 0.61748403, 0.79811218, 0.61559222, 0.6475456 , 0.61468198,
-0.7458466 , -3.71140488, -1.17174927])
Which one should I use?
I know this is a long question so Thanks!!
Train set should be distinctly use for training your model, while test set is for final evaluation. But unfortunately, you need to test your model's score on some set before checking it on final result (test set): for example when you try to tune some hyper-parameters. There are some other reasons to use cv, it's just one of them.
Usually the process is:
Split train and test
Train model use cv to check stability, including hyper-tune params (which is irrelevant in your case)
Assess model score on test set.
scikit-learn's cross_val_score receives an object (before training!) and data. It trains each time model on different section of data, and then returns the score. It's like having a lot of "train-test" checks.
Therefore, you should:
from sklearn.model_selection import cross_val_score
reg = LinearRegression()
cv = cross_val_score(reg, X_train.reshape(-1,1), y_train, cv = 8)
solely on train set. Test set should be used for other purposes.
What you get is a list of accuracy score. You can see if your model is stable (does accuracy is in same range among all folds?) or general performance of model (avg score)

How to test unseen test data with cross validation and predict labels?

1.The CSV that contains data(ie. text description) along with categorized labels
df = pd.read_csv('./output/csv_sanitized_16_.csv', dtype=str)
X = df['description_plus']
y = df['category_id']
2.This CSV contains unseen data(ie. text description) for which labels need to be predicted
df_2 = pd.read_csv('./output/csv_sanitized_2.csv', dtype=str)
X2 = df_2['description_plus']
Cross validation function that operates on the training data(item #1) above.
def cross_val():
cv = KFold(n_splits=20)
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
X_train = vectorizer.fit_transform(X)
clf = make_pipeline(preprocessing.StandardScaler(with_mean=False), svm.SVC(C=1))
scores = cross_val_score(clf, X_train, y, cv=cv)
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
cross_val()
I need to know how to pass the unseen data(item #2) to the cross validation function and how to predict the labels?
Using scores = cross_val_score(clf, X_train, y, cv=cv) you can only get the cross-validated scores of the model. cross_val_score will internally split the data into training and testing based on the cv parameter.
So the values that you get are the cross-validated accuracy of the SVC.
To get the score on the unseen data, you can first fit the model e.g.
clf = make_pipeline(preprocessing.StandardScaler(with_mean=False), svm.SVC(C=1))
clf.fit(X_train, y) # the model is trained now
and then do clf.score(X_unseen,y)
The last will return the accuracy of the model on the unseen data.
EDIT: The best way to do what you want is the following using a GridSearch to first find the best model using the training data and then evaluate the best model using the unseen (test) data:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
# load some data
iris = datasets.load_iris()
X, y = iris.data, iris.target
#split data to training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
# hyperparameter tunig of the SVC model
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
# fit the GridSearch using the TRAINING data
grid_searcher = GridSearchCV(svc, parameters)
grid_searcher.fit(X_train, y_train)
#recover the best estimator (best parameters for the SVC, based on the GridSearch)
best_SVC_model = grid_searcher.best_estimator_
# Now, check how this best model behaves on the test set
cv_scores_on_unseen = cross_val_score(best_SVC_model, X_test, y_test, cv=5)
print(cv_scores_on_unseen.mean())

ML Model not predicting properly

I am trying to create an ML model (regression) using various techniques like SMR, Logistic Regression, and others. With all the techniques, I'm not able to get efficiency more than 35%. Here's what I'm doing:
X_data = [X_data_distance]
X_data = np.vstack(X_data).astype(np.float64)
X_data = X_data.T
y_data = X_data_orders
#print(X_data.shape)
#print(y_data.shape)
#(10000, 1)
#(10000,)
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.33, random_state=42)
svr_rbf = SVC(kernel= 'rbf', C= 1.0)
svr_rbf.fit(X_train, y_train)
plt.plot(X_data_distance, svr_rbf.predict(X_data), color= 'red', label= 'RBF model')
For the plot, I'm getting the following:
I have tried various parameter tuning, changing the parameter C, gamma even tried different kernels, but nothing changes the accuracy. Even tried SVR, Logistic regression instead of SVC, but nothing helps. I tried different scaling for training input data like StandardScalar() and scale().
I used this as a reference
What should I do?
As a rule of thumb, we usually follow this convention:
For little number of features, go with Logistic Regression.
For a lot of features but not a lot of data, go with SVM.
For a lot of features and a lot of data, go with Neural Network.
Because your dataset is a 10K cases, it'd be better to use Logistic Regression because SVM will take forever to finish!.
Nevertheless, because your dataset contains a lot of classes, there is a chance of classes imbalance in your implementation. Thus I tried to workaround this problem via using the StratifiedKFold instead of train_test_split which doesn't guarantee balanced classes in the splits.
Moreover, I used GridSearchCV with StratifiedKFold to perform Cross-Validation in order to tune the parameters and try all different optimizers!
So the full implementation is as follows:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, StratifiedKFold, StratifiedShuffleSplit
import numpy as np
def getDataset(path, x_attr, y_attr):
"""
Extract dataset from CSV file
:param path: location of csv file
:param x_attr: list of Features Names
:param y_attr: Y header name in CSV file
:return: tuple, (X, Y)
"""
df = pd.read_csv(path)
X = X = np.array(df[x_attr]).reshape(len(df), len(x_attr))
Y = np.array(df[y_attr])
return X, Y
def stratifiedSplit(X, Y):
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
train_index, test_index = next(sss.split(X, Y))
X_train, X_test = X[train_index], X[test_index]
Y_train, Y_test = Y[train_index], Y[test_index]
return X_train, X_test, Y_train, Y_test
def run(X_data, Y_data):
X_train, X_test, Y_train, Y_test = stratifiedSplit(X_data, Y_data)
param_grid = {'C': [0.01, 0.1, 1, 10, 100, 1000], 'penalty': ['l1', 'l2'],
'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
model = LogisticRegression(random_state=0)
clf = GridSearchCV(model, param_grid, cv=StratifiedKFold(n_splits=10))
clf.fit(X_train, Y_train)
print(accuracy_score(Y_train, clf.best_estimator_.predict(X_train)))
print(accuracy_score(Y_test, clf.best_estimator_.predict(X_test)))
X_data, Y_data = getDataset("data - Sheet1.csv", ['distance'], 'orders')
run(X_data, Y_data)
Despite all the attempts with all different algorithms, the accuracy didn't exceed 36%!!.
Why is that?
If you want to make a person recognize/classify another person by their T-shirt color, you cannot say: hey if it's red that means he's John and if it's red it's Peter but if it's red it's Aisling!! He would say "really, what the hack is the difference"?!!.
And that's exactly what is in your dataset!
Simply, run print(len(np.unique(X_data))) and print(len(np.unique(Y_data))) and you'll find that the numbers are so weird, in a nutshell you have:
Number of Cases: 10000 !!
Number of Classes: 118 !!
Number of Unique Inputs (i.e. Features): 66 !!
All classes are sharing hell a lot of information which make it impressive to have even up to 36% accuracy!
In other words, you have no informative features which lead to a lack in the uniqueness of each class model!
What to do?
I believe you are not allowed to remove some classes, so the only two solutions you have are:
Either live with this very valid result.
Or add more informative feature(s).
Update
Having you provided same dataset but with more features (i.e. complete set of features), the situation now is different.
I recommend you do the following:
Pre-process your dataset (i.e. prepare it by imputing missing values or deleting rows containing missing values, and converting dates to some unique values (example) ...etc).
Check what features are most important to the Orders Classes, you can achieve that by using of Forests of Trees to evaluate the importance of features. Here is a complete and simple example of how to do that in Scikit-Learn.
Create a new version of the dataset but this time hold Orders as the Y response, and the above-found features as the X variables.
Follow the same GrdiSearchCV and StratifiedKFold procedure that I showed you in the implementation above.
Hint
As per mentioned by Vivek Kumar in the comment below, stratify parameter has been added in Scikit-learn update to the train_test_split function.
It works by passing the array-like ground truth, so you don't need my workaround in the function stratifiedSplit(X, Y) above.

cross validation and text classification for imbalanced data

I am new to NLP and I am trying to build a text classifier but my data is currently imbalanced.The highest category having as much as 280 entries while the lowest as much as 30.
I am trying to use cross validation technique for the current data, but after looking for days now i am unable to implement it.It looks pretty straightforward but I am still unable to implement it. Here is my code
y = resample.Subsystem
X = resample['new description']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
#SVM
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words='english')),('tfidf', TfidfTransformer()),('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, n_iter=5, random_state=42)),])
text_clf_svm.fit(X_train, y_train)
predicted_svm = text_clf_svm.predict(X_test)
print('The best accuracy is : ',np.mean(predicted_svm == y_test))
I have done some gridsearch and Stemmer further but right now I would work on cross validation on this code.I have cleaned the data pretty well but i am stil getting an accuracy of 60%
Any help would be appreciated
Try to do oversampling or under sampling. As the data is highly imbalanced, There is more bias towards the class with more data points. After the over/under sampling the bias will be very less and accuracy will up.
Else instead of SVM you can use MLP. It gives good results even with unbalanced data.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, random_state=None)
# X is the feature set and y is the target
from sklearn.model_selection import RepeatedKFold
kf = RepeatedKFold(n_splits=20, n_repeats=10, random_state=None)
for train_index, test_index in kf.split(X):
#print("Train:", train_index, "Validation:",test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

How to run only one fold of cross validation in sklearn?

I have he following code to run a 10-fold cross validation in SkLearn:
cv = model_selection.KFold(n_splits=10, shuffle=True, random_state=0)
scores = model_selection.cross_val_score(MyEstimator(), x_data, y_data, cv=cv, scoring='mean_squared_error') * -1
For debugging purposes, while I am trying to make MyEstimator work, I would like to run only one fold of this cross-validation, instead of all 10. Is there an easy way to keep this code but just say to run the first fold and then exit?
I would still like that data is split into 10 parts, but that only one combination of that 10 parts is fitted and scored, instead of 10 combinations.
No, not with cross_val_score I suppose. You can set n_splits to minimum value of 2, but still that will be 50:50 split of train, test which you may not want.
If you want maintain a 90:10 ration and test other parts of code like MyEstimator(), then you can use a workaround.
You can use KFold.split() to get the first set of train and test indices and then break the loop after first iteration.
cv = model_selection.KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(x_data):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = x_data[train_index], x_data[test_index]
y_train, y_test = y_data[train_index], y_data[test_index]
break
Now use this X_train, y_train to train the estimator and X_test, y_test to score it.
Instead of :
scores = model_selection.cross_val_score(MyEstimator(),
x_data, y_data,
cv=cv,
scoring='mean_squared_error')
Your code becomes:
myEstimator_fitted = MyEstimator().fit(X_train, y_train)
y_pred = myEstimator_fitted.predict(X_test)
from sklearn.metrics import mean_squared_error
# I am appending to a scores list object, because that will be output of cross_val_score.
scores = []
scores.append(mean_squared_error(y_test, y_pred))
Rest assured, cross_val_score will be doing this only internally, just some enhancements for parallel processing.

Resources