Errors in RFE and KFOLD method using Logistic Regression - python-3.x

I am getting errors in RFE and K-Fold method in python using logistic regression. How to achieve error-free code?
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
accuracies = []
feature_set = []
max_accuracy_so_far = 0
for i in range(1,len(X[0])+1):
selector = RFE(LogisticRegression(), i,verbose=1)
selector = selector.fit(X, y)
current_accuracy = selector.score(X,y)
accuracies.append(current_accuracy)
feature_set.append(selector.support_)
if max_accuracy_so_far < current_accuracy:
max_accuracy_so_far = current_accuracy
selected_features = selector.support_
print('End of iteration no. {}'.format(i))
X_sub = X[:,selected_features]
#KFOLD model score
scores = []
max_score = 0
from sklearn.model_selection import KFold
kf = KFold(n_splits=4,random_state=0,shuffle=True)
for train_index, test_index in kf.split(X_sub):
X_train, X_test = X_sub[train_index], X_sub[test_index]
y_train, y_test = y[train_index], y[test_index]
current_model = LogisticRegression()
#train the model
current_model.fit(X_train,y_train)
#see performance score
current_score = model.score(X_test,y_test)
scores.append(current_score)
if max_score < current_score:
max_score = current_score
best_model = current_model
best_model.intercept_
best_model.coef_
What is the correct output?
I expect the output to be correct

Related

log loss computed manually diverging from the cross_validation_score method from scikit-learn

I have a question about how the cross_val_score() from the Scikit-Learn works. I tried divide the dataset in 10 folds with Kfold() and compute the log loss of both training and validation sets for each fold. However I got different answers using the cross_validation_score, setting the parameter scoring = 'neg_log_loss'.
X and y are arrays of shape (1800, 12) and (1800, 1), respectively.
kfold = KFold(n_splits=10)
train_loss = []
val_loss = []
for train_index, val_index in kfold.split(X, y):
clf_logreg = LogisticRegression()
#
X_train, X_val = X[train_index], X[val_index]
y_train, y_val = y[train_index], y[val_index]
clf_logreg.fit(X_train, y_train)
y_train_pred = clf_logreg.predict(X_train)
y_val_pred = clf_logreg.predict(X_val)
train_loss.append(log_loss(y_train, y_train_pred))
val_loss.append(log_loss(y_val, y_val_pred))
clf_logreg.fit(X,y)
y_error = cross_val_score(clf_logreg, X, y, cv=kfold, scoring='neg_log_loss')
print("cross_val log_loss: ", -y_error)
print("\ntrain_loss: ", train_loss)
print("\nval_loss: ", val_loss)
The answers I got:
cross_val log_loss: [0.18546779 0.18002459 0.21591202 0.15872213 0.22852112 0.18766844
0.28641203 0.14923009 0.21446935 0.20373971]
train_loss: [2.79298449379999, 2.7290223160363962, 2.558457002245472, 2.835624958485065, 2.5797806896386337, 2.622420660745048, 2.5797797024813125, 2.6224201671663874, 2.5797782217453302, 2.6863818513513213]
val_loss: [1.9188431218680995, 2.1107385395747733, 3.645826363693089, 2.110734097366828, 3.2620355282797417, 2.686367043991502, 3.453913177154633, 2.4944849529086657, 2.8782624616981765, 2.4944938373245567]
As Ben Reiniger noted in the comment log_loss expects probabilities in y_train_pred, y_val_pred. So you need to change
clf_logreg.predict
to:
clf_logreg.predict_proba
Example:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.model_selection import KFold, cross_val_score
X, y = load_iris(return_X_y=True)
y = y == 1
kfold = KFold(n_splits=10, random_state=1, shuffle=True)
train_loss = []
val_loss = []
for train_index, val_index in kfold.split(X, y):
clf_logreg = LogisticRegression()
#
X_train, X_val = X[train_index], X[val_index]
y_train, y_val = y[train_index], y[val_index]
clf_logreg.fit(X_train, y_train)
y_train_pred = clf_logreg.predict_proba(X_train)
y_val_pred = clf_logreg.predict_proba(X_val)
train_loss.append(log_loss(y_train, y_train_pred))
val_loss.append(log_loss(y_val, y_val_pred))
clf_logreg.fit(X, y)
y_error = cross_val_score(clf_logreg, X, y, cv=kfold, scoring="neg_log_loss")
print("cross_val log_loss: ", -y_error)
print("\nval_loss: ", val_loss)
Results:
cross_val log_loss: [0.53548503 0.54200945 0.60324094 0.64781483 0.43323992 0.37625601
0.55101127 0.46172226 0.50216316 0.64359642]
val_loss: [0.5354850268015129, 0.5420094471965571, 0.6032409439788419, 0.647814828089315, 0.43323991804482626, 0.3762560144867495, 0.5510112741331039, 0.46172225526408, 0.5021631570133954, 0.6435964210060579]

Model fit after each "fold" in LOOCV? Multilabel LOOCV by hand

I am wondering if it is an issue (possible data leakage?) when implementing leave one out cross validation by hand if the model is fit each iteration after testing on each fold? It seems like if the model is trained on all data except for "X" and after testing on "X" the model is trained on all data other than "Y" and tested on "Y" it has seen "Y" on the first iteration. Is this actually a problem, and does my implementation of LOOCV by hand appear to be correct?
Thanks for your time!
i = 0
j = 0
for i in range(0, 41):
X_copy = X_orig[(i):(i+1)] #Slice the ith element from the numpy array
y_copy = y_orig[(i):(i+1)]
X_model = X_orig
y_model = y_orig
X_model = np.delete(X_model, i, axis = 0)
y_model = np.delete(y_model, i, axis = 0)
model.fit(X_model, y_model, epochs=115, batch_size=28, verbose = 0) #verbose = 0 removes learning info
prediction = model.predict(X_copy)
prediction[prediction>=0.5] = 1
prediction[prediction<0.5] = 0
print(prediction, y_copy)
if np.array_equal(y_copy, prediction):
j = j + 1
#print(y_copy, prediction)
if np.not_equal:
#print(y_copy, prediction)
pass
print(j/41) #For 41 samples in dataset
Why don't you use this?
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
model =...
test_fold_predictions = []
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
test_fold_predictions.append(model.predict(X_test))
EDIT
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD
model = Sequential()
model.add(Dense(5000, activation='relu', input_dim=X_train.shape[1]))
model.add(Dropout(0.1))
model.add(Dense(600, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(y_train.shape[1], activation='sigmoid'))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy',
optimizer=sgd)
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
test_fold_predictions = []
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train, epochs=5, batch_size=2000)
test_fold_predictions.append(model.predict(X_test))

What's the purpose of np_utils.to_categorical() in keras?

Hello I'm training my self in sentiment recognition using audio file and a code from a repository of git.
Code sample:
newdf1 = np.random.rand(len(rnewdf)) < 0.8
train = rnewdf[newdf1]
test = rnewdf[~newdf1]
trainfeatures = train.iloc[:, :-1]
trainlabel = train.iloc[:, -1:]
testfeatures = test.iloc[:, :-1]
testlabel = test.iloc[:, -1:]
from keras.utils import np_utils
from sklearn.preprocessing import LabelEncoder
X_train = np.array(trainfeatures)
y_train = np.array(trainlabel)
X_test = np.array(testfeatures)
y_test = np.array(testlabel)
lb = LabelEncoder()
y_train = np_utils.to_categorical(lb.fit_transform(y_train))
y_test = np_utils.to_categorical(lb.fit_transform(y_test))
I'd like to understand what's this code do.
y_train = np_utils.to_categorical(lb.fit_transform(y_train))
y_test = np_utils.to_categorical(lb.fit_transform(y_test))
I ask this question because in training phase of the CNN, I've got an error in model.fit
Error when checking target: expected activation_26 to have shape (1,)...
understand this may help me to overcome the problem.
thanks

Perform cross validation without cross_val_score

In order to have full access to the inner and outer score I would like to create a nested cros validation and grid-search without using cross_val_score.
I have followed examples I found online like this https://github.com/rasbt/pattern_classification/blob/master/data_viz/model-evaluation-articles/nested_cv_code.ipynb.
I am having doubts that the inner nest is ok. I am not sure if I have to split the data before calling GridSearchCV:
for train_index_inner, test_index_inner in inner_cv.split(X_train_outer, y_train_outer):
X_train_inner = X_train_outer[train_index_inner]
y_train_inner = y_train_outer[train_index_inner]
X_test_inner = X_train_outer[test_index_inner]
y_test_inner = y_train_outer[test_index_inner]
# inner cross-validation
for name, gs_est in sorted(gridcvs.items()):
#print(gs_est)
gs_est.fit(X_train_inner, y_train_inner)
y_pred = gs_est.predict(X_test_inner)
#print(y_test_inner.shape)
inner_score = r2_score(y_true=y_test_inner, y_pred=y_pred)
cv_scores[name].append(inner_score)
#for mean_score, params in zip(gs_est.cv_results_ ['mean_test_score'],
#gs_est.cv_results_ ['params']):
#print(name, params, mean_score)
print('print cvscores for model:', cv_scores)
outer_counter = outer_counter + 1
The whole code:
import numpy as np
from sklearn.datasets import make_regression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
import operator
perf_list = [] # list with the performance
hp_list = [] # hyperparameter list
algo_familiy = [] # algorithm familiy list
##################################################################################################
randomState=1
average_scores_across_outer_folds_for_each_model = dict()
X, y = make_regression(n_samples=1000, n_features=10)
##################################################################################################
# Create X_test, y_test = TEST SET
# Create X_train, y_train = TRAIN & VALIDATION SET
X_train, X_gtest, y_train, y_gtest= train_test_split(X, y, train_size=0.8, random_state=randomState)
print(X_train.shape)
#print(X_train.shape)
#print(X_test.shape)
#print(y_train.shape)
#print(y_test.shape)
##################################################################################################
##################################################################
# Regressors you want to use
reg1 = KNeighborsRegressor()
reg2 = RandomForestRegressor()
# Building the pipelines (Transformer, Classifier)
pipe1 = Pipeline([('std' , StandardScaler()),
('reg1', reg1)])
pipe2 = Pipeline([('std' , StandardScaler()),
('reg2', reg2)])
# Setting up parameters for grid
param_grid1 = [{'reg1__n_neighbors': list(range(7, 10))}]
param_grid2 = [{'reg2__max_depth': [50, 20]}]
# outer cross-validation
outer_counter = 1
outer_cv = KFold(n_splits=3, shuffle=True)
inner_cv = KFold(n_splits=2, shuffle=True, random_state=randomState)
##################################################################
###########################
gridcvs = {}
for pgrid, est, name in zip((param_grid1, param_grid2),
(pipe1, pipe2),
('KNN', 'RF')):
regressor_that_optimizes_its_hyperparams = GridSearchCV(estimator=est,
param_grid=pgrid,
scoring='r2',
n_jobs=1,
cv=inner_cv,
verbose=0,
refit=True)
gridcvs[name] = regressor_that_optimizes_its_hyperparams
##################################################################
##################################################################
for train_index_outer, test_index_outer in outer_cv.split(X_train, y_train):
print('outer_cv', outer_counter)
X_train_outer = X_train[train_index_outer]
y_train_outer = y_train[train_index_outer]
X_test_outer = X_train[test_index_outer]
y_test_outer = y_train[test_index_outer]
# print(X_train_outer.shape)
# print(X_test_outer.shape)
cv_scores = {name: [] for name, gs_est in gridcvs.items()}
for train_index_inner, test_index_inner in inner_cv.split(X_train_outer, y_train_outer):
X_train_inner = X_train_outer[train_index_inner]
y_train_inner = y_train_outer[train_index_inner]
X_test_inner = X_train_outer[test_index_inner]
y_test_inner = y_train_outer[test_index_inner]
# inner cross-validation
for name, gs_est in sorted(gridcvs.items()):
#print(gs_est)
gs_est.fit(X_train_inner, y_train_inner)
y_pred = gs_est.predict(X_test_inner)
#print(y_test_inner.shape)
inner_score = r2_score(y_true=y_test_inner, y_pred=y_pred)
cv_scores[name].append(inner_score)
#for mean_score, params in zip(gs_est.cv_results_ ['mean_test_score'],
#gs_est.cv_results_ ['params']):
#print(name, params, mean_score)
print('print cvscores for model:', cv_scores)
outer_counter = outer_counter + 1
# Looking at the results
#####################################################################
for name in cv_scores:
print('%-8s | outer CV acc. %.2f%% +\- %.3f' % (
name, 100 * np.mean(cv_scores[name]), 100 * np.std(cv_scores[name])))
many_stars = '\n' + '*' * 100 + '\n'
print(many_stars + 'Now we choose the best model and refit on the whole dataset' + many_stars)
# Fitting a model to the whole training set
# using the "best" algorithm
best_algo = gridcvs['RF']
best_algo.fit(X_train, y_train)
train_acc = r2_score(y_true=y_train, y_pred=best_algo.predict(X_train))
test_acc = r2_score(y_true=y_gtest, y_pred=best_algo.predict(X_gtest))
print('Accuracy %.2f%% (average over CV test folds)' %
(100 * best_algo.best_score_))
print('Best Parameters: %s' % gridcvs['RF'].best_params_)
print('Training Accuracy: %.2f%%' % (100 * train_acc))
print('Test Accuracy: %.2f%%' % (100 * test_acc))
# Fitting a model to the whole dataset
# using the "best" algorithm and hyperparameter settings
best_clf = best_algo.best_estimator_
final_model = best_clf.fit(X, y)
In general you can obtain the nested cross-validation using the code you posted.
for train_index_outer, test_index_outer in outer_cv.split(X_train, y_train):
print('outer_cv', outer_counter)
X_train_outer = X_train[train_index_outer]
y_train_outer = y_train[train_index_outer]
X_test_outer = X_train[test_index_outer]
y_test_outer = y_train[test_index_outer]
for train_index_inner, test_index_inner in inner_cv.split(X_train_outer, y_train_outer):
X_train_inner = X_train_outer[train_index_inner]
y_train_inner = y_train_outer[train_index_inner]
X_test_inner = X_train_outer[test_index_inner]
y_test_inner = y_train_outer[test_index_inner]
# fit something on X_train_inner
# evaluate it on X_test_inner
or you could do the following:
If you pass to GridSearchCV the argument cv inner_cv, then the GridSearchCV will automatically perform the split when you call the .fit() method. When the fit is complete you can explore the .cv_results to get the individual model score on each of the automatically generated inner folds.
for train_index_outer, test_index_outer in outer_cv.split(X_train, y_train):
X_train_outer = X_train[train_index_outer]
y_train_outer = y_train[train_index_outer]
X_test_outer = X_train[test_index_outer]
y_test_outer = y_train[test_index_outer]
cv= GridSearchCV(estimator=est,
param_grid=pgrid,
scoring='r2',
n_jobs=1,
cv=inner_cv,
verbose=0,
refit=True)
cv.fit(X_train_outer,y_train_outer)

Python different results for manual and cross_val_score prediction

I have one question, I'm trying to implement KFold and cross_val_score.
My goal is to calculate mean_squared_errorand for this purpose I used the following code:
from sklearn import linear_model
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score
x = np.random.random((10000,20))
y = np.random.random((10000,1))
x_train = x[7000:]
y_train = y[7000:]
x_test = x[:7000]
y_test = y[:7000]
Model = linear_model.LinearRegression()
Model.fit(x_train,y_train)
y_predicted = Model.predict(x_test)
MSE = mean_squared_error(y_test,y_predicted)
print(MSE)
kfold = KFold(n_splits = 100, random_state = None, shuffle = False)
results = cross_val_score(Model,x,y,cv=kfold, scoring='neg_mean_squared_error')
print(results.mean())
I think it's all right here, I got the following results:
Results: 0.0828856459279 and -0.083069435946
But when I try to do this on some other example (datas from Kaggle House Prices), it does not work properly, at least I think so..
train = pd.read_csv('train.csv')
Insert missing values...
...
train = pd.get_dummies(train)
y = train['SalePrice']
train = train.drop(['SalePrice'], axis = 1)
x_train = train[:1000].values.reshape(-1,339)
y_train = y[:1000].values.reshape(-1,1)
y_train_normal = np.log(y_train)
x_test = train[1000:].values.reshape(-1,339)
y_test = y[1000:].values.reshape(-1,1)
Model = linear_model.LinearRegression()
Model.fit(x_train,y_train_normal)
y_predicted = Model.predict(x_test)
y_predicted_transform = np.exp(y_predicted)
MSE = mean_squared_error(y_test, y_predicted_transform)
print(MSE)
kfold = KFold(n_splits = 10, random_state = None, shuffle = False)
results = cross_val_score(Model,train,y, cv = kfold, scoring = "neg_mean_squared_error")
print(results.mean())
Here I get the following results: 0.912874946869 and -6.16986926564e+16
Apparently, the mean_squared_error calculated 'manually' is not the same as the mean_squared_error calculated by the help of KFold.
I'm interested in where I made a mistake?
The discrepancy is because, in contrast to your first approach (training/test set), in your CV approach you use the unnormalized y data for fitting the regression, hence your huge MSE. To get comparable results, you should do the following:
y_normal = np.log(y)
y_test_normal = np.log(y_test)
MSE = mean_squared_error(y_test_normal, y_predicted) # NOT y_predicted_transform
results = cross_val_score(Model, train, y_normal, cv = kfold, scoring = "neg_mean_squared_error")

Resources