SKLearn Error with Pipeline and Gridsearch - scikit-learn

I would like to first split my data in a test and train set. Then I want to use GridSearchCV on my training set (internally split into train/validation set). In the end I want to collect all the testdata and do some other things (not in the scope of the question).
I have to scale my data. So I want to handle this problem in a pipeline. Some things in my SVC should be ficed (kernel='rbf', class_weight=...).
When I run the code the following occurs:
"ValueError: Invalid parameter estimator for estimator Pipeline"
I don't understand what I'm doing wrong. I tried to follow this thread: StandardScaler with Pipelines and GridSearchCV
The only difference is, that I fix some parameters in my SVC. How can I handle this?
target = np.array(target).ravel()
loo = LeaveOneOut()
loo.get_n_splits(input)
# Outer Loop
for train_index, test_index in loo.split(input):
X_train, X_test = input[train_index], input[test_index]
y_train, y_test = target[train_index], target[test_index]
p_grid = {'estimator__C': np.logspace(-5, 2, 20),}
'estimator__gamma': np.logspace(-5, 3, 20)}
SVC_Kernel = SVC(kernel='rbf', class_weight='balanced',tol=10e-4, max_iter=200000, probability=False)
pipe_SVC = Pipeline([('scaler', RobustScaler()),('SVC', SVC_Kernel)])
n_splits = 5
scoring = "f1_micro"
inner_cv = StratifiedKFold(n_splits=n_splits,
shuffle=True, random_state=5)
clfSearch = GridSearchCV(estimator=pipe_SVC, param_grid=p_grid,
cv=inner_cv, scoring='f1_micro', iid=False, n_jobs=-1)
clfSearch.fit(X_train, y_train)
print("Best parameters set found on validation set for Support Vector Machine:")
print()
print(clfSearch.best_params_)
print()
print(clfSearch.best_score_)
print("Grid scores on validation set:")
print()
I also tried it this way:
p_grid = {'estimator__C': np.logspace(-5, 2, 20),
'estimator__gamma': np.logspace(-5, 3, 20),
'estimator__tol': [10e-4],
'estimator__kernel': ['rbf'],
'estimator__class_weight': ['balanced'],
'estimator__max_iter':[200000],
'estimator__probability': [False]}
SVC_Kernel = SVC()
This also doesn't work.

The problem is in your p_grid. You are grid searching on your Pipeline, and that doesn't have anything called estimator. It does have something called SVC, so if you want to set that SVC's parameter, you should prefix you keys with SVC__ instead of estimator__. So replace p_grid with:
p_grid = {'SVC__C': np.logspace(-5, 2, 20),}
'SVC__gamma': np.logspace(-5, 3, 20)}
Also, you can replace your outer for loop using cross_validate function.

Related

Pipeline for more than 2 classifiers

I am trying to build an ensemble using Knn and random forest classifiers.
steps = [('scaler', StandardScaler()),
('regressor', VotingClassifier(estimators=[
('knn', KNeighborsClassifier()),
('clf', RandomForestClassifier())]))]
pipeline = Pipeline(steps)
parameters = [{'knn__n_neighbors': np.arange(1, 50)}, {
'clf__n_estimators': [10, 20, 30],
'clf__criterion': ['gini', 'entropy'],
'clf__max_features': [5, 10, 15],
'clf__max_depth': ['auto', 'log2', 'sqrt', None]}]
X_train, X_test, y_train, y_test = train_test_split(X, y.values.ravel(),
test_size=0.3, random_state=65)
cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)
I have encourntered the following error while running the above code:
Invalid parameter knn for estimator Pipeline(steps=[('scaler', StandardScaler()),
('regressor',VotingClassifier(estimators=[('knn', KNeighborsClassifier()),('clf', RandomForestClassifier())]))]). Check the list of available parameters with estimator.get_params().keys()
Since I am new to machine learning I having difficulty in understanding the error.
I agree that the error message is not clear, but the error is raised because of the knn estimator in your VotingClassifier. Please check the VotingClassifier documentation:
voting : str, {'hard', 'soft'}, default='hard'
If 'hard', uses predicted class labels for majority rule voting. Else if 'soft', predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.
(...)
attributes : dict, default=None
Dictionary mapping each estimator to a list of attributes to be extracted as predictors.
If None, all public estimator attributes are used.
Concretely, KNeighborsClassifier has no predict_proba method, thus it cannot work with the soft voting classifier.
If you want to keep both estimators in your VotingClassifier, you should set voting='hard':
VotingClassifier(estimators=[('knn', KNeighborsClassifier()), ('clf', RandomForestClassifier())],voting='hard')
Let me know if it helped.

how to use an explicit validation set with predefined split fold?

I have explicit train, test and validation sets as 2d arrays:
X_train.shape
(1400, 38785)
X_val.shape
(200, 38785)
X_test.shape
(400, 38785)
I am tuning the alpha parameter and need advice about how I can use the predefined validation set in it:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV, PredefinedSplit
nb = MultinomialNB()
nb.fit(X_train, y_train)
params = {'alpha': [0.1, 1, 3, 5, 10,12,14]}
# how to use on my validation set?
# ps = PredefinedSplit(test_fold=?)
gs = GridSearchCV(nb, param_grid=params, cv = ps, return_train_score=True, scoring='f1')
gs.fit(X_train, y_train)
My results are as following so far.
# on my validation set, alpha = 5
gs.fit(X_val, y_val)
print('Grid best parameter', gs.best_params_)
Grid best parameter: {'alpha': 5}
# on my training set, alpha = 10
Grid best parameter: {'alpha': 10}
I have read the following questions and documentation yet I am not sure how to use PredefinedSplit() in my case. Thank you.
Order between using validation, training and test sets
https://scikit-learn.org/stable/modules/cross_validation.html#predefined-fold-splits-validation-sets
You can achieve your desired outcome by merging X_train and X_val, and passing PredefinedSplit a list of labels, with -1 indicating training data and 1 indicating validation data. IE,
X = np.concatenate((X_train, X_val))
y = np.concatenate((y_train, y_val))
ps = PredefinedSplit(np.concatenate((np.zeros(len(x_train) - 1, np.ones(len(x_val))))
gs = GridSearchCV(nb, param_grid=params, cv = ps, return_train_score=True, scoring='f1')
gs.fit(X, y) # not X_train, y_train
However, unless there is very a good reason for you holding out a separate validation set, you will likely have less overfitting if you use k-fold cross validation for your hyperparameter tuning rather than using a dedicated validation set.

Grid Search CV invalid parameter error for decision tree

I have been trying to do grid search CV in case of decision trees. I am getting the following error.
ValueError: Invalid parameter min_samples_split for estimator RecursiveTabularRegressionForecaster(estimator=DecisionTreeRegressor()). Check the list of available parameters with estimator.get_params().keys().
from sktime.forecasting.model_selection import ForecastingRandomizedSearchCV
from sktime.forecasting.model_selection import SlidingWindowSplitter
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.forecasting.compose import make_reduction
y = univ['GHI']
y_train, y_test = temporal_train_test_split(y, test_size=30)
regressor = DecisionTreeRegressor()
forecaster = make_reduction(regressor)
params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [2, 3, 4]}
cv = SlidingWindowSplitter(initial_window=60, window_length=30)
nrcv = ForecastingRandomizedSearchCV(forecaster, strategy="refit", cv=cv,
param_distributions=params,
n_iter=5, random_state=42)
nrcv.fit(y_train)
y_pred = nrcv.predict(np.arange(1, y_test.size+1))
print(nrcv.best_params_)
print(nrcv.best_score_)
I have also tried 'estimator.get_params().keys()' this function. and I think its matching with the parameters.
dict_keys(['ccp_alpha', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_impurity_split', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'random_state', 'splitter'])

Kfold cross validation in python

What im trying to do;
Get the K-fold cross validated scores of an SVM. The data has all numerical independent variables, and a categorical dependent variable. Im using python3, sklearn and feature engine.
My understanding on the matter;
The independent variable has NA values, all of them are below 5% of the total data points, so i imputed them using the median values from the train set, as the variables are not normally distributed. I also scaled the values of the train and test set using the values from the test set. My train-test split is 80-20.
I understand that it is a good practice to scaled and impute data using only the train set. As this helps avoid over-fit and data leak.
When it comes to Kfold cross validation, the train and test set change.
Question;
Is there a way to ensure that i can re-impute and re-scale the train and test set based on the train set of each fold ?
Any help is appreciated, thank you !
Train-test split using a random seed. Same random seed is used in the K-Fold cross validation.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 3)
NA value imputation;
from feature_engine import missing_data_imputers as mdi
imputer = mdi.MeanMedianImputer(imputation_method = 'median')
imputer.fit(X_train)
X_train = imputer.transform(X_train)
Variable transformation;
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_trans = scaler.transform(X_train)
X_test_trans = scaler.transform(X_test)
Below is the SVM;
def svm1(gam, C):
clf1 = svm.SVC(gamma=gam, C=C)
clf1.fit(X_train_trans, y_train)
print('The Trainset Score is {}.'.format(clf1.score(X_train_trans , y_train)))
print('The Testset Score is {}.'.format(clf1.score(X_test_trans , y_test)))
print('')
y_pred1 = clf1.predict(X_test_trans)
print('The confusin matrix is; \n{}'.format(metrics.confusion_matrix(y_test , y_pred1)))
interactive(svm1, gam = G1, C = cc1)
I then merge the train and test set, to get back a transformed dataset;
frames3 = [X_test_trans, X_train_trans ]
X_Final = pd.concat(frames3)
Now i fit the X_Final, which is concated train and test set, to get K-fold cross validated score.
kfold = KFold(n_splits = 10, random_state = 3)
model = svm.SVC(gamma=0.23, C=3.20)
results = cross_val_score(model, PCA_X_Final,y_Final, cv = kfold)
print(results)
print('Accuracy = {}%, Standard Deviation = {}%'.format(round(results.mean(), 4), round(results.std(), 2)))
I would like to know how i can re-scale and re-impute each fold, so that the variables are re-scaled, and NA values re-imputed in each fold using the train set to avoid overfit / dataleak
To impute and scale the data with the parameters derived from each fold in the CV, you first need to establish the engineering steps in a pipeline, and then do CV over the entire pipeline. For example something like this:
set up engineering pipeline:
my_pipe = Pipeline([
# missing data imputation
('imputer_num',
mdi.MeanMedianImputer(imputation_method='mean', variables=['varA', 'varB'])),
# scaler
('scaler', StandardScaler()),
# Gradient Boosted machine (or your SVM instead)
('gbm', GradientBoostingClassifier(random_state=0))
])
then the CV:
param_grid = {
# try different gradient boosted tree model parameters
'gbm__max_depth': [None, 1, 3],
}
# now we set up the grid search with cross-validation
grid_search = GridSearchCV(my_pipe, param_grid,
cv=5, n_jobs=-1, scoring='roc_auc')
More details in this notebook.

feature selection using logistic regression

I am performing feature selection ( on a dataset with 1,930,388 rows and 88 features) using Logistic Regression. If I test the model on held-out data, the accuracy is just above 60%. The response variable is equally distributed. My question is, if the model's performance is not good, can I consider the features that it gives as actual important features? Or should I try to improve the accuracy of the model though my end-goal is not to improve the accuracy but only get important features
sklearn's GridSearchCV has some pretty neat methods to give you the best feature set. For example, consider the following code
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english',sublinear_tf=True)),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5, 0.6, 0.7, 1.0),
'vect__ngram_range': ((1, 1), (1, 2), (2,3), (1,3), (1,4), (1,5)),
'vect__use_idf': (True, False),
'clf__C': (0.1, 1, 10, 20, 30)
}
here the parameters array holds all of the different parameters that i need to consider. notice the use if vect__max_df. max_df is an actual key that is used by my vectorizer, which is my feature selector. So,
'vect__max_df': (0.25, 0.5, 0.6, 0.7, 1.0),
actually specifies that i want to try out the above 5 values for my vectorizer. Similarly for the others. Notice how i have tied my vectorizer to the key 'vect' and my classifier to the key 'clf'. Can you see the pattern? Moving on
traindf = pd.read_json('../../data/train.json')
traindf['ingredients_clean_string'] = [' , '.join(z).strip() for z in traindf['ingredients']]
traindf['ingredients_string'] = [' '.join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', line)) for line in lists]).strip() for lists in traindf['ingredients']]
X, y = traindf['ingredients_string'], traindf['cuisine'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
grid_search = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)
print ('best score: %0.3f' % grid_search.best_score_)
print ('best parameters set:')
bestParameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print ('\t %s: %r' % (param_name, bestParameters[param_name]))
predictions = grid_search.predict(X_test)
print ('Accuracy:', accuracy_score(y_test, predictions))
print ('Confusion Matrix:', confusion_matrix(y_test, predictions))
print ('Classification Report:', classification_report(y_test, predictions))
note that the bestParameters array will give me the best set of parameters out of all the options that i specified while creating my pipeline.
Hope this helps.
Edit: To get a list of features selected
so once you have your best set of parameters, create vectorizers and classifiers with those parameter values
vect = TfidfVectorizer('''use the best parameters here''')
then you basically train this vectorizer again. in doing so, the vectorizer will choose certain features from your training set.
traindf = pd.read_json('../../data/train.json')
traindf['ingredients_clean_string'] = [' , '.join(z).strip() for z in traindf['ingredients']]
traindf['ingredients_string'] = [' '.join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', line)) for line in lists]).strip() for lists in traindf['ingredients']]
X, y = traindf['ingredients_string'], traindf['cuisine'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
termDocMatrix = vect.fit_transform(X_train, y_train)
now, the termDocMatrix has all of the selected features. also, you can use the vectorizer to get the feature names. lets say you want to get the top 100 features. and your metric for comparison is the chi square score
getKbest = SelectKBest(chi2, k = 100)
now just
print(np.asarray(vect.get_feature_names())[getKbest.get_support()])
should give you the top 100 features. try this.

Resources