how to use an explicit validation set with predefined split fold? - python-3.x

I have explicit train, test and validation sets as 2d arrays:
X_train.shape
(1400, 38785)
X_val.shape
(200, 38785)
X_test.shape
(400, 38785)
I am tuning the alpha parameter and need advice about how I can use the predefined validation set in it:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV, PredefinedSplit
nb = MultinomialNB()
nb.fit(X_train, y_train)
params = {'alpha': [0.1, 1, 3, 5, 10,12,14]}
# how to use on my validation set?
# ps = PredefinedSplit(test_fold=?)
gs = GridSearchCV(nb, param_grid=params, cv = ps, return_train_score=True, scoring='f1')
gs.fit(X_train, y_train)
My results are as following so far.
# on my validation set, alpha = 5
gs.fit(X_val, y_val)
print('Grid best parameter', gs.best_params_)
Grid best parameter: {'alpha': 5}
# on my training set, alpha = 10
Grid best parameter: {'alpha': 10}
I have read the following questions and documentation yet I am not sure how to use PredefinedSplit() in my case. Thank you.
Order between using validation, training and test sets
https://scikit-learn.org/stable/modules/cross_validation.html#predefined-fold-splits-validation-sets

You can achieve your desired outcome by merging X_train and X_val, and passing PredefinedSplit a list of labels, with -1 indicating training data and 1 indicating validation data. IE,
X = np.concatenate((X_train, X_val))
y = np.concatenate((y_train, y_val))
ps = PredefinedSplit(np.concatenate((np.zeros(len(x_train) - 1, np.ones(len(x_val))))
gs = GridSearchCV(nb, param_grid=params, cv = ps, return_train_score=True, scoring='f1')
gs.fit(X, y) # not X_train, y_train
However, unless there is very a good reason for you holding out a separate validation set, you will likely have less overfitting if you use k-fold cross validation for your hyperparameter tuning rather than using a dedicated validation set.

Related

how to transform test data to pick same features as training dataset

simply, i am trying to apply same feature selection to test data as i did to train set, however test doesn't have the same exact shape.
def get_important_features (X_train, Y_train, X_test):
'''
:param X_train: features of training set of type scipy.sparse.csr_matrix
:param Y_train: labels of training set of type scipy.sparse.csr_matrix
:param X_test: features of test set of type scipy.sparse.csr_matrix
:return:
'''
select_percentile = SelectPercentile(chi2, percentile=75)
print(X_train.shape)
print(X_test.shape)
X_new_train = select_percentile.fit_transform(X_train, Y_train)
#print(select_percentile.get_support(indices=True))
X_new_test = select_percentile.transform(X_test)
return X_new_train, X_new_test
so training set shape (836, 3188) and test set shape (633, 3187) as you can see testing set does not has the same shape as training set however all i care about picking only features that exist in training set after applying chi2. Also, as you might know X_new_test = select_percentile.transform(X_test) throw value error ValueError: X has a different shape than during fitting. because of the reason i mentioned above. Is there any way i can extract theses features from X_test without using transform(X_test)?
Note: that input is csr matrix not a dataframe so i get this values from libsvm format document.
train= load_svmlight_file(train_file_name)
X_train = train[0]
Y_train = train[1]
test= load_svmlight_file(test_file_name)
X_test = test[0]
Y_test = test[1]
I tried with your function and it works. Make sure you are passing the data in a correct way. Below is the minimal example for your reference:
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import chi2
# dummy data
train = pd.DataFrame(np.random.randint(1000, size=(50, 10)), columns=['A'+str(x) for x in range(10)])
test = pd.DataFrame(np.random.randint(1000, size=(30, 9)), columns=['A'+str(x) for x in range(9)])
# assuming the last column is the target variable
X_new_train, X_new_test = get_important_features(train.iloc[:,:-1], train.iloc[:,-1], test)
print(X_new_train.shape, X_new_test.shape)
(50, 6) (30, 6)

Kfold cross validation in python

What im trying to do;
Get the K-fold cross validated scores of an SVM. The data has all numerical independent variables, and a categorical dependent variable. Im using python3, sklearn and feature engine.
My understanding on the matter;
The independent variable has NA values, all of them are below 5% of the total data points, so i imputed them using the median values from the train set, as the variables are not normally distributed. I also scaled the values of the train and test set using the values from the test set. My train-test split is 80-20.
I understand that it is a good practice to scaled and impute data using only the train set. As this helps avoid over-fit and data leak.
When it comes to Kfold cross validation, the train and test set change.
Question;
Is there a way to ensure that i can re-impute and re-scale the train and test set based on the train set of each fold ?
Any help is appreciated, thank you !
Train-test split using a random seed. Same random seed is used in the K-Fold cross validation.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 3)
NA value imputation;
from feature_engine import missing_data_imputers as mdi
imputer = mdi.MeanMedianImputer(imputation_method = 'median')
imputer.fit(X_train)
X_train = imputer.transform(X_train)
Variable transformation;
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_trans = scaler.transform(X_train)
X_test_trans = scaler.transform(X_test)
Below is the SVM;
def svm1(gam, C):
clf1 = svm.SVC(gamma=gam, C=C)
clf1.fit(X_train_trans, y_train)
print('The Trainset Score is {}.'.format(clf1.score(X_train_trans , y_train)))
print('The Testset Score is {}.'.format(clf1.score(X_test_trans , y_test)))
print('')
y_pred1 = clf1.predict(X_test_trans)
print('The confusin matrix is; \n{}'.format(metrics.confusion_matrix(y_test , y_pred1)))
interactive(svm1, gam = G1, C = cc1)
I then merge the train and test set, to get back a transformed dataset;
frames3 = [X_test_trans, X_train_trans ]
X_Final = pd.concat(frames3)
Now i fit the X_Final, which is concated train and test set, to get K-fold cross validated score.
kfold = KFold(n_splits = 10, random_state = 3)
model = svm.SVC(gamma=0.23, C=3.20)
results = cross_val_score(model, PCA_X_Final,y_Final, cv = kfold)
print(results)
print('Accuracy = {}%, Standard Deviation = {}%'.format(round(results.mean(), 4), round(results.std(), 2)))
I would like to know how i can re-scale and re-impute each fold, so that the variables are re-scaled, and NA values re-imputed in each fold using the train set to avoid overfit / dataleak
To impute and scale the data with the parameters derived from each fold in the CV, you first need to establish the engineering steps in a pipeline, and then do CV over the entire pipeline. For example something like this:
set up engineering pipeline:
my_pipe = Pipeline([
# missing data imputation
('imputer_num',
mdi.MeanMedianImputer(imputation_method='mean', variables=['varA', 'varB'])),
# scaler
('scaler', StandardScaler()),
# Gradient Boosted machine (or your SVM instead)
('gbm', GradientBoostingClassifier(random_state=0))
])
then the CV:
param_grid = {
# try different gradient boosted tree model parameters
'gbm__max_depth': [None, 1, 3],
}
# now we set up the grid search with cross-validation
grid_search = GridSearchCV(my_pipe, param_grid,
cv=5, n_jobs=-1, scoring='roc_auc')
More details in this notebook.

Fitting in nested cross-validation with cross_val_score with pipeline and GridSearch

I am working in scikit and I am trying to tune my XGBoost.
I made an attempt to use a nested cross-validation using the pipeline for the rescaling of the training folds (to avoid data leakage and overfitting) and in parallel with GridSearchCV for param tuning and cross_val_score to get the roc_auc score at the end.
from imblearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
std_scaling = StandardScaler()
algo = XGBClassifier()
steps = [('std_scaling', StandardScaler()), ('algo', XGBClassifier())]
pipeline = Pipeline(steps)
parameters = {'algo__min_child_weight': [1, 2],
'algo__subsample': [0.6, 0.9],
'algo__max_depth': [4, 6],
'algo__gamma': [0.1, 0.2],
'algo__learning_rate': [0.05, 0.5, 0.3]}
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
clf_auc = GridSearchCV(pipeline, cv = cv1, param_grid = parameters, scoring = 'roc_auc', n_jobs=-1, return_train_score=False)
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
outer_clf_auc = cross_val_score(clf_auc, X_train, y_train, cv = cv1, scoring = 'roc_auc')
Question 1.
How do I fit cross_val_score to the training data?
Question2.
Since I included the StandardScaler() in the pipeline does it make sense to include the X_train in the cross_val_score or should I use a standardized form of the X_train (i.e. std_X_train)?
std_scaler = StandardScaler().fit(X_train)
std_X_train = std_scaler.transform(X_train)
std_X_test = std_scaler.transform(X_test)
You chose the right way to avoid data leakage as you say - nested CV.
The thing is in nested CV what you estimate is not the score of a real estimator you can "hold in your hand", but of a non-existing "meta-estimator" which describes you model selection process as well.
Meaning - in every round of the outer cross validation (in your case represented by cross_val_score), the estimator clf_auc undergoes internal CV which selects the best model under the given fold of the external CV.
Therefore, for every fold of the external CV you are scoring a different estimator chosen by the internal CV.
For example, in one external CV fold the model scored can be one that selected the param algo__min_child_weight to be 1, and in another a model that selected it to be 2.
The score of the external CV therefore represents a more high-level score: "under the process of reasonable model selection, how well will my selected model generalize".
Now, if you want to finish the process with a real model in hand you would have to select it in some way (cross_val_score will not do that for you).
The way to do that is to now fit your internal model over the entire data.
meaning to perform:
clf_auc.fit(X, y)
This is the moment to understand what you've done here:
You have a model you can use, which is fitted over all the data available.
When you're asked "how well does that model generalizes on new data?" the answer is the score you got during your nested CV - which captured the model selection process as part of your model's scoring.
And regarding Question #2 - if the scaler is part of the pipeline, there is no reason to manipulate the X_train externally.

Nested cross validation with StratifiedShuffleSplit in sklearn

I am working on a binary classification problem and would like to perform the nested cross validation to assess the classification error. The reason why I'm doing the nested CV is due to the small sample size (N_0 = 20, N_1 = 10), where N_0, N_1 are the numbers of instances in 0 and 1 classes respectively.
My code is quite simple:
>> pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', LogisticRegression(penalty='l1'))])
>> parameters = {'clf__C': logspace(-4,1,50)}
>> grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='f1', cv=5)
>> cross_val_score(grid_search, X, y, cv=5)
So far, so good. If I want to change the CV scheme (from random splitting to StratifiedShuffleSplit in both, outer and inner CV loops, I face the problem: how can I pass the class vector y, as it is required by the StratifiedShuffleSplit function?
Naively:
>> grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='f1', cv=StratifiedShuffleSplit(y_inner_loop, 5, test_size=0.5, random_state=0))
>> cross_val_score(grid_search, X, y, cv=StratifiedShuffleSplit(y, 5, test_size=0.5, random_state=0))
So, the problem is how to specify the y_inner_loop ?
** My data set is slightly imbalanced (20/10) and I would like to keep this splitting ratio for training and assessing the model.
So far, I resolved this problem which might be of interested to some novices to ML. In the newest version of the scikit-learn 0.18, cross validated metrics have moved to sklearn.model_selection module and have changed (slightly) their API. Making long story short:
>> from sklearn.model_selection import StratifiedShuffleSplit
>> sss_outer = StratifiedShuffleSplit(n_splits=5, test_size=0.4, random_state=15)
>> sss_inner = StratifiedShuffleSplit(n_splits=3, test_size=0.2, random_state=16)
>> pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', LogisticRegression(penalty='l1'))])
>> parameters = {'clf__C': logspace(-4,1,50)}
>> grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='f1', cv=sss_inner)
>> cross_val_score(grid_search, X, y, cv=sss_outer)
UPD in the newest version, we do not need to specify explicitly the target vector ("y", which was my problem initially), but rather only the number of desired splits.

feature selection using logistic regression

I am performing feature selection ( on a dataset with 1,930,388 rows and 88 features) using Logistic Regression. If I test the model on held-out data, the accuracy is just above 60%. The response variable is equally distributed. My question is, if the model's performance is not good, can I consider the features that it gives as actual important features? Or should I try to improve the accuracy of the model though my end-goal is not to improve the accuracy but only get important features
sklearn's GridSearchCV has some pretty neat methods to give you the best feature set. For example, consider the following code
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english',sublinear_tf=True)),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5, 0.6, 0.7, 1.0),
'vect__ngram_range': ((1, 1), (1, 2), (2,3), (1,3), (1,4), (1,5)),
'vect__use_idf': (True, False),
'clf__C': (0.1, 1, 10, 20, 30)
}
here the parameters array holds all of the different parameters that i need to consider. notice the use if vect__max_df. max_df is an actual key that is used by my vectorizer, which is my feature selector. So,
'vect__max_df': (0.25, 0.5, 0.6, 0.7, 1.0),
actually specifies that i want to try out the above 5 values for my vectorizer. Similarly for the others. Notice how i have tied my vectorizer to the key 'vect' and my classifier to the key 'clf'. Can you see the pattern? Moving on
traindf = pd.read_json('../../data/train.json')
traindf['ingredients_clean_string'] = [' , '.join(z).strip() for z in traindf['ingredients']]
traindf['ingredients_string'] = [' '.join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', line)) for line in lists]).strip() for lists in traindf['ingredients']]
X, y = traindf['ingredients_string'], traindf['cuisine'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
grid_search = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)
print ('best score: %0.3f' % grid_search.best_score_)
print ('best parameters set:')
bestParameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print ('\t %s: %r' % (param_name, bestParameters[param_name]))
predictions = grid_search.predict(X_test)
print ('Accuracy:', accuracy_score(y_test, predictions))
print ('Confusion Matrix:', confusion_matrix(y_test, predictions))
print ('Classification Report:', classification_report(y_test, predictions))
note that the bestParameters array will give me the best set of parameters out of all the options that i specified while creating my pipeline.
Hope this helps.
Edit: To get a list of features selected
so once you have your best set of parameters, create vectorizers and classifiers with those parameter values
vect = TfidfVectorizer('''use the best parameters here''')
then you basically train this vectorizer again. in doing so, the vectorizer will choose certain features from your training set.
traindf = pd.read_json('../../data/train.json')
traindf['ingredients_clean_string'] = [' , '.join(z).strip() for z in traindf['ingredients']]
traindf['ingredients_string'] = [' '.join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', line)) for line in lists]).strip() for lists in traindf['ingredients']]
X, y = traindf['ingredients_string'], traindf['cuisine'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
termDocMatrix = vect.fit_transform(X_train, y_train)
now, the termDocMatrix has all of the selected features. also, you can use the vectorizer to get the feature names. lets say you want to get the top 100 features. and your metric for comparison is the chi square score
getKbest = SelectKBest(chi2, k = 100)
now just
print(np.asarray(vect.get_feature_names())[getKbest.get_support()])
should give you the top 100 features. try this.

Resources