Kfold cross validation in python - python-3.x

What im trying to do;
Get the K-fold cross validated scores of an SVM. The data has all numerical independent variables, and a categorical dependent variable. Im using python3, sklearn and feature engine.
My understanding on the matter;
The independent variable has NA values, all of them are below 5% of the total data points, so i imputed them using the median values from the train set, as the variables are not normally distributed. I also scaled the values of the train and test set using the values from the test set. My train-test split is 80-20.
I understand that it is a good practice to scaled and impute data using only the train set. As this helps avoid over-fit and data leak.
When it comes to Kfold cross validation, the train and test set change.
Question;
Is there a way to ensure that i can re-impute and re-scale the train and test set based on the train set of each fold ?
Any help is appreciated, thank you !
Train-test split using a random seed. Same random seed is used in the K-Fold cross validation.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 3)
NA value imputation;
from feature_engine import missing_data_imputers as mdi
imputer = mdi.MeanMedianImputer(imputation_method = 'median')
imputer.fit(X_train)
X_train = imputer.transform(X_train)
Variable transformation;
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_trans = scaler.transform(X_train)
X_test_trans = scaler.transform(X_test)
Below is the SVM;
def svm1(gam, C):
clf1 = svm.SVC(gamma=gam, C=C)
clf1.fit(X_train_trans, y_train)
print('The Trainset Score is {}.'.format(clf1.score(X_train_trans , y_train)))
print('The Testset Score is {}.'.format(clf1.score(X_test_trans , y_test)))
print('')
y_pred1 = clf1.predict(X_test_trans)
print('The confusin matrix is; \n{}'.format(metrics.confusion_matrix(y_test , y_pred1)))
interactive(svm1, gam = G1, C = cc1)
I then merge the train and test set, to get back a transformed dataset;
frames3 = [X_test_trans, X_train_trans ]
X_Final = pd.concat(frames3)
Now i fit the X_Final, which is concated train and test set, to get K-fold cross validated score.
kfold = KFold(n_splits = 10, random_state = 3)
model = svm.SVC(gamma=0.23, C=3.20)
results = cross_val_score(model, PCA_X_Final,y_Final, cv = kfold)
print(results)
print('Accuracy = {}%, Standard Deviation = {}%'.format(round(results.mean(), 4), round(results.std(), 2)))
I would like to know how i can re-scale and re-impute each fold, so that the variables are re-scaled, and NA values re-imputed in each fold using the train set to avoid overfit / dataleak

To impute and scale the data with the parameters derived from each fold in the CV, you first need to establish the engineering steps in a pipeline, and then do CV over the entire pipeline. For example something like this:
set up engineering pipeline:
my_pipe = Pipeline([
# missing data imputation
('imputer_num',
mdi.MeanMedianImputer(imputation_method='mean', variables=['varA', 'varB'])),
# scaler
('scaler', StandardScaler()),
# Gradient Boosted machine (or your SVM instead)
('gbm', GradientBoostingClassifier(random_state=0))
])
then the CV:
param_grid = {
# try different gradient boosted tree model parameters
'gbm__max_depth': [None, 1, 3],
}
# now we set up the grid search with cross-validation
grid_search = GridSearchCV(my_pipe, param_grid,
cv=5, n_jobs=-1, scoring='roc_auc')
More details in this notebook.

Related

how to use an explicit validation set with predefined split fold?

I have explicit train, test and validation sets as 2d arrays:
X_train.shape
(1400, 38785)
X_val.shape
(200, 38785)
X_test.shape
(400, 38785)
I am tuning the alpha parameter and need advice about how I can use the predefined validation set in it:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV, PredefinedSplit
nb = MultinomialNB()
nb.fit(X_train, y_train)
params = {'alpha': [0.1, 1, 3, 5, 10,12,14]}
# how to use on my validation set?
# ps = PredefinedSplit(test_fold=?)
gs = GridSearchCV(nb, param_grid=params, cv = ps, return_train_score=True, scoring='f1')
gs.fit(X_train, y_train)
My results are as following so far.
# on my validation set, alpha = 5
gs.fit(X_val, y_val)
print('Grid best parameter', gs.best_params_)
Grid best parameter: {'alpha': 5}
# on my training set, alpha = 10
Grid best parameter: {'alpha': 10}
I have read the following questions and documentation yet I am not sure how to use PredefinedSplit() in my case. Thank you.
Order between using validation, training and test sets
https://scikit-learn.org/stable/modules/cross_validation.html#predefined-fold-splits-validation-sets
You can achieve your desired outcome by merging X_train and X_val, and passing PredefinedSplit a list of labels, with -1 indicating training data and 1 indicating validation data. IE,
X = np.concatenate((X_train, X_val))
y = np.concatenate((y_train, y_val))
ps = PredefinedSplit(np.concatenate((np.zeros(len(x_train) - 1, np.ones(len(x_val))))
gs = GridSearchCV(nb, param_grid=params, cv = ps, return_train_score=True, scoring='f1')
gs.fit(X, y) # not X_train, y_train
However, unless there is very a good reason for you holding out a separate validation set, you will likely have less overfitting if you use k-fold cross validation for your hyperparameter tuning rather than using a dedicated validation set.

How to test unseen test data with cross validation and predict labels?

1.The CSV that contains data(ie. text description) along with categorized labels
df = pd.read_csv('./output/csv_sanitized_16_.csv', dtype=str)
X = df['description_plus']
y = df['category_id']
2.This CSV contains unseen data(ie. text description) for which labels need to be predicted
df_2 = pd.read_csv('./output/csv_sanitized_2.csv', dtype=str)
X2 = df_2['description_plus']
Cross validation function that operates on the training data(item #1) above.
def cross_val():
cv = KFold(n_splits=20)
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
X_train = vectorizer.fit_transform(X)
clf = make_pipeline(preprocessing.StandardScaler(with_mean=False), svm.SVC(C=1))
scores = cross_val_score(clf, X_train, y, cv=cv)
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
cross_val()
I need to know how to pass the unseen data(item #2) to the cross validation function and how to predict the labels?
Using scores = cross_val_score(clf, X_train, y, cv=cv) you can only get the cross-validated scores of the model. cross_val_score will internally split the data into training and testing based on the cv parameter.
So the values that you get are the cross-validated accuracy of the SVC.
To get the score on the unseen data, you can first fit the model e.g.
clf = make_pipeline(preprocessing.StandardScaler(with_mean=False), svm.SVC(C=1))
clf.fit(X_train, y) # the model is trained now
and then do clf.score(X_unseen,y)
The last will return the accuracy of the model on the unseen data.
EDIT: The best way to do what you want is the following using a GridSearch to first find the best model using the training data and then evaluate the best model using the unseen (test) data:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
# load some data
iris = datasets.load_iris()
X, y = iris.data, iris.target
#split data to training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
# hyperparameter tunig of the SVC model
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
# fit the GridSearch using the TRAINING data
grid_searcher = GridSearchCV(svc, parameters)
grid_searcher.fit(X_train, y_train)
#recover the best estimator (best parameters for the SVC, based on the GridSearch)
best_SVC_model = grid_searcher.best_estimator_
# Now, check how this best model behaves on the test set
cv_scores_on_unseen = cross_val_score(best_SVC_model, X_test, y_test, cv=5)
print(cv_scores_on_unseen.mean())

Prediction with linear regression is very inaccurate

This is the csv that im using https://gist.github.com/netj/8836201 currently, im trying to predict the variety which is categorical data with linear regression but somehow the prediction is very very inaccurate. While you know, the actual label is just combination of 0.0 and 1. but the prediction is 0.numbers and 1.numbers even with minus numbers which in my opinion is very inaccurate, what part did i make the mistake and what is the solution for this inaccuracy? this is the assignment my teacher gave me, he said we could predict the categorical data with linear regression not only logistic regression
import pandas as pd
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
from sklearn import metrics
path= r"D:\python projects\iris.csv"
df = pd.read_csv(path)
array = df.values
X = array[:,0:3]
y = array[:,4]
le = preprocessing.LabelEncoder()
ohe = preprocessing.OneHotEncoder(categorical_features=[0])
y = le.fit_transform(y)
y = y.reshape(-1,1)
y = ohe.fit_transform(y).toarray()
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2, random_state=0)
sc = preprocessing.StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
y_train = sc.fit_transform(y_train)
model = LinearRegression(n_jobs=-1).fit(X_train, y_train)
y_pred = model.predict(X_test)
df = pd.DataFrame({'Actual': X_test.flatten(), 'Predicted': y_pred.flatten()})
the output :
y_pred
Out[46]:
array([[-0.08676055, 0.43120144, 0.65555911],
[ 0.11735424, 0.72384335, 0.1588024 ],
[ 1.17081347, -0.24484483, 0.07403136],
X_test
Out[61]:
array([[-0.09544771, -0.58900572, 0.72247648],
[ 0.14071157, -1.98401928, 0.10361279],
[-0.44968663, 2.66602591, -1.35915595],
Linear Regression is used to predict continuous output data. As you correctly said, you are trying to predict categorical (discrete) output data. Essentially, you want to be doing classification instead of regression - linear regression is not appropriate for this.
As you also said, logistic regression can and should be used instead as it is applicable to classification tasks.

Upsampling using SMOTE in python

I am trying to use SMOTE in python to handle highly imbalanced data set. After splitting the data set into train and test I generate synthetic samples using SMOTE. Then I use xgboost algorithm on the SMOTE generated data. My model output is to predict the probability for the original dataset. But after implementing SMOTE the number of samples have been increased and how do I get back the original data set to predict the probabilities? Code as below:
X_train, X_test, y_train, y_test = train_test_split(X_final, Y_final, test_size=0.1, random_state = 27)
sm = SMOTE(random_state=27, ratio=1.0)
X_final_sm, Y_final_sm = sm.fit_sample(X_train, y_train)
smote_xgb = XGBClassifier().fit(X_final_sm, Y_final_sm)
smote_pred = smote_xgb.predict(X_final_sm)
smote_pred_prob = smote_xgb.predict_proba(X_final_sm)

Scaling of stock data

I am trying to apply machine learning on stock prediction, and I run into problem regarding scaling on future unseen (much higher) stock close value.
Lets say I use random forrest regression on predicting stock price. I break the data into train set and test set.
For the train set, I use standardscaler, and do fit and transform
And then I use regressor to fit
For the test set, I use standardscaler, and do transform
And then I use regressor to predict, and compare to test label
If I plot predict and test label on a graph, predict seems to max out or ceiling. The problem is that standardscaler fit on train set, test set (later in the timeline) have much higher value, the algorithm does not know what to do with these extreme data
def test(X, y):
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=False)
# preprocess the data
pipeline = Pipeline([
('std_scaler', StandardScaler()),
])
# model = LinearRegression()
model = RandomForestRegressor(n_estimators=20, random_state=0)
# preprocessing fit transform on train data
X_train = pipeline.fit_transform(X_train)
# fit model on train data with train label
model.fit(X_train, y_train)
# transform on test data
X_test = pipeline.transform(X_test)
# predict on test data
y_pred = model.predict(X_test)
# print(np.sqrt(mean_squared_error(y_test, y_pred)))
d = {'actual': y_test, 'predict': y_pred}
plot_data = pd.DataFrame.from_dict(d)
sns.lineplot(data=plot_data)
plt.show()
What should be done with the scaling?
This is what I got for plotting prediction, actual close price vs time
The problem mainly comes from the model you are using. RandomForest regressor is created upon Decision Trees. It is learning to map an input to an output for every examples in the training set. Consequently RandomForest regressor will work for middle values but for extreme values that it hasn't seen during training it will of course perform has your picture is showing.
What you want, is to learn a function directly using linear/polynomial regression or more advanced algorithms like ARIMA.

Resources