Nested cross validation with StratifiedShuffleSplit in sklearn - scikit-learn

I am working on a binary classification problem and would like to perform the nested cross validation to assess the classification error. The reason why I'm doing the nested CV is due to the small sample size (N_0 = 20, N_1 = 10), where N_0, N_1 are the numbers of instances in 0 and 1 classes respectively.
My code is quite simple:
>> pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', LogisticRegression(penalty='l1'))])
>> parameters = {'clf__C': logspace(-4,1,50)}
>> grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='f1', cv=5)
>> cross_val_score(grid_search, X, y, cv=5)
So far, so good. If I want to change the CV scheme (from random splitting to StratifiedShuffleSplit in both, outer and inner CV loops, I face the problem: how can I pass the class vector y, as it is required by the StratifiedShuffleSplit function?
Naively:
>> grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='f1', cv=StratifiedShuffleSplit(y_inner_loop, 5, test_size=0.5, random_state=0))
>> cross_val_score(grid_search, X, y, cv=StratifiedShuffleSplit(y, 5, test_size=0.5, random_state=0))
So, the problem is how to specify the y_inner_loop ?
** My data set is slightly imbalanced (20/10) and I would like to keep this splitting ratio for training and assessing the model.

So far, I resolved this problem which might be of interested to some novices to ML. In the newest version of the scikit-learn 0.18, cross validated metrics have moved to sklearn.model_selection module and have changed (slightly) their API. Making long story short:
>> from sklearn.model_selection import StratifiedShuffleSplit
>> sss_outer = StratifiedShuffleSplit(n_splits=5, test_size=0.4, random_state=15)
>> sss_inner = StratifiedShuffleSplit(n_splits=3, test_size=0.2, random_state=16)
>> pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', LogisticRegression(penalty='l1'))])
>> parameters = {'clf__C': logspace(-4,1,50)}
>> grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='f1', cv=sss_inner)
>> cross_val_score(grid_search, X, y, cv=sss_outer)
UPD in the newest version, we do not need to specify explicitly the target vector ("y", which was my problem initially), but rather only the number of desired splits.

Related

how to use an explicit validation set with predefined split fold?

I have explicit train, test and validation sets as 2d arrays:
X_train.shape
(1400, 38785)
X_val.shape
(200, 38785)
X_test.shape
(400, 38785)
I am tuning the alpha parameter and need advice about how I can use the predefined validation set in it:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV, PredefinedSplit
nb = MultinomialNB()
nb.fit(X_train, y_train)
params = {'alpha': [0.1, 1, 3, 5, 10,12,14]}
# how to use on my validation set?
# ps = PredefinedSplit(test_fold=?)
gs = GridSearchCV(nb, param_grid=params, cv = ps, return_train_score=True, scoring='f1')
gs.fit(X_train, y_train)
My results are as following so far.
# on my validation set, alpha = 5
gs.fit(X_val, y_val)
print('Grid best parameter', gs.best_params_)
Grid best parameter: {'alpha': 5}
# on my training set, alpha = 10
Grid best parameter: {'alpha': 10}
I have read the following questions and documentation yet I am not sure how to use PredefinedSplit() in my case. Thank you.
Order between using validation, training and test sets
https://scikit-learn.org/stable/modules/cross_validation.html#predefined-fold-splits-validation-sets
You can achieve your desired outcome by merging X_train and X_val, and passing PredefinedSplit a list of labels, with -1 indicating training data and 1 indicating validation data. IE,
X = np.concatenate((X_train, X_val))
y = np.concatenate((y_train, y_val))
ps = PredefinedSplit(np.concatenate((np.zeros(len(x_train) - 1, np.ones(len(x_val))))
gs = GridSearchCV(nb, param_grid=params, cv = ps, return_train_score=True, scoring='f1')
gs.fit(X, y) # not X_train, y_train
However, unless there is very a good reason for you holding out a separate validation set, you will likely have less overfitting if you use k-fold cross validation for your hyperparameter tuning rather than using a dedicated validation set.

Kfold cross validation in python

What im trying to do;
Get the K-fold cross validated scores of an SVM. The data has all numerical independent variables, and a categorical dependent variable. Im using python3, sklearn and feature engine.
My understanding on the matter;
The independent variable has NA values, all of them are below 5% of the total data points, so i imputed them using the median values from the train set, as the variables are not normally distributed. I also scaled the values of the train and test set using the values from the test set. My train-test split is 80-20.
I understand that it is a good practice to scaled and impute data using only the train set. As this helps avoid over-fit and data leak.
When it comes to Kfold cross validation, the train and test set change.
Question;
Is there a way to ensure that i can re-impute and re-scale the train and test set based on the train set of each fold ?
Any help is appreciated, thank you !
Train-test split using a random seed. Same random seed is used in the K-Fold cross validation.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 3)
NA value imputation;
from feature_engine import missing_data_imputers as mdi
imputer = mdi.MeanMedianImputer(imputation_method = 'median')
imputer.fit(X_train)
X_train = imputer.transform(X_train)
Variable transformation;
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_trans = scaler.transform(X_train)
X_test_trans = scaler.transform(X_test)
Below is the SVM;
def svm1(gam, C):
clf1 = svm.SVC(gamma=gam, C=C)
clf1.fit(X_train_trans, y_train)
print('The Trainset Score is {}.'.format(clf1.score(X_train_trans , y_train)))
print('The Testset Score is {}.'.format(clf1.score(X_test_trans , y_test)))
print('')
y_pred1 = clf1.predict(X_test_trans)
print('The confusin matrix is; \n{}'.format(metrics.confusion_matrix(y_test , y_pred1)))
interactive(svm1, gam = G1, C = cc1)
I then merge the train and test set, to get back a transformed dataset;
frames3 = [X_test_trans, X_train_trans ]
X_Final = pd.concat(frames3)
Now i fit the X_Final, which is concated train and test set, to get K-fold cross validated score.
kfold = KFold(n_splits = 10, random_state = 3)
model = svm.SVC(gamma=0.23, C=3.20)
results = cross_val_score(model, PCA_X_Final,y_Final, cv = kfold)
print(results)
print('Accuracy = {}%, Standard Deviation = {}%'.format(round(results.mean(), 4), round(results.std(), 2)))
I would like to know how i can re-scale and re-impute each fold, so that the variables are re-scaled, and NA values re-imputed in each fold using the train set to avoid overfit / dataleak
To impute and scale the data with the parameters derived from each fold in the CV, you first need to establish the engineering steps in a pipeline, and then do CV over the entire pipeline. For example something like this:
set up engineering pipeline:
my_pipe = Pipeline([
# missing data imputation
('imputer_num',
mdi.MeanMedianImputer(imputation_method='mean', variables=['varA', 'varB'])),
# scaler
('scaler', StandardScaler()),
# Gradient Boosted machine (or your SVM instead)
('gbm', GradientBoostingClassifier(random_state=0))
])
then the CV:
param_grid = {
# try different gradient boosted tree model parameters
'gbm__max_depth': [None, 1, 3],
}
# now we set up the grid search with cross-validation
grid_search = GridSearchCV(my_pipe, param_grid,
cv=5, n_jobs=-1, scoring='roc_auc')
More details in this notebook.

ML Model not predicting properly

I am trying to create an ML model (regression) using various techniques like SMR, Logistic Regression, and others. With all the techniques, I'm not able to get efficiency more than 35%. Here's what I'm doing:
X_data = [X_data_distance]
X_data = np.vstack(X_data).astype(np.float64)
X_data = X_data.T
y_data = X_data_orders
#print(X_data.shape)
#print(y_data.shape)
#(10000, 1)
#(10000,)
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.33, random_state=42)
svr_rbf = SVC(kernel= 'rbf', C= 1.0)
svr_rbf.fit(X_train, y_train)
plt.plot(X_data_distance, svr_rbf.predict(X_data), color= 'red', label= 'RBF model')
For the plot, I'm getting the following:
I have tried various parameter tuning, changing the parameter C, gamma even tried different kernels, but nothing changes the accuracy. Even tried SVR, Logistic regression instead of SVC, but nothing helps. I tried different scaling for training input data like StandardScalar() and scale().
I used this as a reference
What should I do?
As a rule of thumb, we usually follow this convention:
For little number of features, go with Logistic Regression.
For a lot of features but not a lot of data, go with SVM.
For a lot of features and a lot of data, go with Neural Network.
Because your dataset is a 10K cases, it'd be better to use Logistic Regression because SVM will take forever to finish!.
Nevertheless, because your dataset contains a lot of classes, there is a chance of classes imbalance in your implementation. Thus I tried to workaround this problem via using the StratifiedKFold instead of train_test_split which doesn't guarantee balanced classes in the splits.
Moreover, I used GridSearchCV with StratifiedKFold to perform Cross-Validation in order to tune the parameters and try all different optimizers!
So the full implementation is as follows:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, StratifiedKFold, StratifiedShuffleSplit
import numpy as np
def getDataset(path, x_attr, y_attr):
"""
Extract dataset from CSV file
:param path: location of csv file
:param x_attr: list of Features Names
:param y_attr: Y header name in CSV file
:return: tuple, (X, Y)
"""
df = pd.read_csv(path)
X = X = np.array(df[x_attr]).reshape(len(df), len(x_attr))
Y = np.array(df[y_attr])
return X, Y
def stratifiedSplit(X, Y):
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
train_index, test_index = next(sss.split(X, Y))
X_train, X_test = X[train_index], X[test_index]
Y_train, Y_test = Y[train_index], Y[test_index]
return X_train, X_test, Y_train, Y_test
def run(X_data, Y_data):
X_train, X_test, Y_train, Y_test = stratifiedSplit(X_data, Y_data)
param_grid = {'C': [0.01, 0.1, 1, 10, 100, 1000], 'penalty': ['l1', 'l2'],
'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
model = LogisticRegression(random_state=0)
clf = GridSearchCV(model, param_grid, cv=StratifiedKFold(n_splits=10))
clf.fit(X_train, Y_train)
print(accuracy_score(Y_train, clf.best_estimator_.predict(X_train)))
print(accuracy_score(Y_test, clf.best_estimator_.predict(X_test)))
X_data, Y_data = getDataset("data - Sheet1.csv", ['distance'], 'orders')
run(X_data, Y_data)
Despite all the attempts with all different algorithms, the accuracy didn't exceed 36%!!.
Why is that?
If you want to make a person recognize/classify another person by their T-shirt color, you cannot say: hey if it's red that means he's John and if it's red it's Peter but if it's red it's Aisling!! He would say "really, what the hack is the difference"?!!.
And that's exactly what is in your dataset!
Simply, run print(len(np.unique(X_data))) and print(len(np.unique(Y_data))) and you'll find that the numbers are so weird, in a nutshell you have:
Number of Cases: 10000 !!
Number of Classes: 118 !!
Number of Unique Inputs (i.e. Features): 66 !!
All classes are sharing hell a lot of information which make it impressive to have even up to 36% accuracy!
In other words, you have no informative features which lead to a lack in the uniqueness of each class model!
What to do?
I believe you are not allowed to remove some classes, so the only two solutions you have are:
Either live with this very valid result.
Or add more informative feature(s).
Update
Having you provided same dataset but with more features (i.e. complete set of features), the situation now is different.
I recommend you do the following:
Pre-process your dataset (i.e. prepare it by imputing missing values or deleting rows containing missing values, and converting dates to some unique values (example) ...etc).
Check what features are most important to the Orders Classes, you can achieve that by using of Forests of Trees to evaluate the importance of features. Here is a complete and simple example of how to do that in Scikit-Learn.
Create a new version of the dataset but this time hold Orders as the Y response, and the above-found features as the X variables.
Follow the same GrdiSearchCV and StratifiedKFold procedure that I showed you in the implementation above.
Hint
As per mentioned by Vivek Kumar in the comment below, stratify parameter has been added in Scikit-learn update to the train_test_split function.
It works by passing the array-like ground truth, so you don't need my workaround in the function stratifiedSplit(X, Y) above.

Keras and Sklearn logreg returning different results

I'm comparing the results of a logistic regressor written in Keras to the default Sklearn Logreg. My input is one-dimensional. My output has two classes and I'm interested in the probability that the output belongs to the class 1.
I'm expecting the results to be almost identical, but they are not even close.
Here is how I generate my random data. Note that X_train, X_test are still vectors, I'm just using capital letters because I'm used to it. Also there is no need for scaling in this case.
X = np.linspace(0, 1, 10000)
y = np.random.sample(X.shape)
y = np.where(y<X, 1, 0)
Here's cumsum of y plotted over X. Doing a regression here is not rocket science.
I do a standard train-test-split:
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train = X_train.reshape(-1,1)
X_test = X_test.reshape(-1,1)
Next, I train a default logistic regressor:
from sklearn.linear_model import LogisticRegression
sk_lr = LogisticRegression()
sk_lr.fit(X_train, y_train)
sklearn_logreg_result = sk_lr.predict_proba(X_test)[:,1]
And a logistic regressor that I write in Keras:
from keras.models import Sequential
from keras.layers import Dense
keras_lr = Sequential()
keras_lr.add(Dense(1, activation='sigmoid', input_dim=1))
keras_lr.compile(loss='mse', optimizer='sgd', metrics=['accuracy'])
_ = keras_lr.fit(X_train, y_train, verbose=0)
keras_lr_result = keras_lr.predict(X_test)[:,0]
And a hand-made solution:
pearson_corr = np.corrcoef(X_train.reshape(X_train.shape[0],), y_train)[0,1]
b = pearson_corr * np.std(y_train) / np.std(X_train)
a = np.mean(y_train) - b * np.mean(X_train)
handmade_result = (a + b * X_test)[:,0]
I expect all three to deliver similar results, but here is what happens. This is a reliability diagram using 100 bins.
I have played around with loss functions and other parameters, but the Keras logreg stays roughly like this. What might be causing the problem here?
edit: Using binary crossentropy is not the solution here, as shown by this plot (note that the input data has changed between the two plots).
While both implementations are a form of Logistic Regression there's quite a few differences. While both solutions converge to a comparable minimum (0.75/0.76 ACC) they are not identical.
Optimizer - keras uses vanille SGD where sklearn's LR is based on
liblinear which implements trust region Newton method
Regularization - sklearn has built in L2 regularization
Weights -The weights are randomly initialized and probably sampled from a different distribution.

Computing Pipeline logistic regression predict_proba in sklearn

I have a dataframe with 3 features and 3 classes that I split into X_train, Y_train, X_test, and Y_test and then run Sklearn's Pipeline with PCA, StandardScaler and finally Logistic Regression. I want to be able to calculate the probabilities directly from the LR weights and the raw data without using predict_proba but don't know how because I'm not sure exactly how pipeline pipes X_test through PCA and StandardScaler into logistic regression. Is this realistic without being able to use PCA's and StandardScaler's fit method? Any help would be greatly appreciated!
So far, I have:
pca = PCA(whiten=True)
scaler = StandardScaler()
logistic = LogisticRegression(fit_intercept = True, class_weight = 'balanced', solver = sag, n_jobs = -1, C = 1.0, max_iter = 200)
pipe = Pipeline(steps = [ ('pca', pca), ('scaler', scaler), ('logistic', logistic) ]
pipe.fit(X_train, Y_train)
predict_probs = pipe.predict_proba(X_test)
coefficents = pipe.steps[2][1].coef_ (3 by 30)
intercepts = pipe.steps[2][1].intercept_ (1 by 3)
This is also the question I don't figure out, thanks for Kumar's answer.
I regarded pipeline will lead to new transform for x_test, but when I tried to run Pipeline composed of StandardScalar and LogisticRegression, and to run my own defined function using StandardScalar and LogisticRegression, I found that Pipeline actually use the transform fitted by x_train. So don't worry about using pipeline, it's really a convenient and useful tool for machine learning!

Resources