I'm playing with RandomizedSearchCV function from scikit-learn. Some academic paper claims that Randomized Search can provide 'good enough' results comparing with a whole grid search, but saves a lot of time.
Surprisingly, on one occasion, the RandomizedSearchCV provided me better results than GridSearchCV. I think GridSearchCV is suppose to be exhaustive, so the result has to be better than RandomizedSearchCV suppose they search through the same grid.
for the same dataset and mostly same settings, GridsearchCV returned me the following result:
Best cv accuracy: 0.7642857142857142
Test set score: 0.725
Best parameters: 'C': 0.02
the RandomizedSearchCV returned me the following result:
Best cv accuracy: 0.7428571428571429
Test set score: 0.7333333333333333
Best parameters: 'C': 0.008
To me the test score of 0.733 is better than 0.725, and the difference between test score and training score for the RandomizedSearchCV is smaller, which to my knowledge means less overfitting.
So why did GridSearchCV return me worse results?
GridSearchCV code:
def linear_SVC(x, y, param, kfold):
param_grid = {'C':param}
k = KFold(n_splits=kfold, shuffle=True, random_state=0)
grid = GridSearchCV(LinearSVC(), param_grid=param_grid, cv=k, n_jobs=4, verbose=1)
return grid.fit(x, y)
#high C means more chance of overfitting
start = timer()
param = [i/1000 for i in range(1,1000)]
param1 = [i for i in range(1,101)]
param.extend(param1)
#progress = progressbar.bar.ProgressBar()
clf = linear_SVC(x=x_train, y=y_train, param=param, kfold=3)
print('LinearSVC:')
print('Best cv accuracy: {}' .format(clf.best_score_))
print('Test set score: {}' .format(clf.score(x_test, y_test)))
print('Best parameters: {}' .format(clf.best_params_))
print()
duration = timer() - start
print('time to run: {}' .format(duration))
RandomizedSearchCV code:
from sklearn.model_selection import RandomizedSearchCV
def Linear_SVC_Rand(x, y, param, kfold, n):
param_grid = {'C':param}
k = StratifiedKFold(n_splits=kfold, shuffle=True, random_state=0)
randsearch = RandomizedSearchCV(LinearSVC(), param_distributions=param_grid, cv=k, n_jobs=4,
verbose=1, n_iter=n)
return randsearch.fit(x, y)
start = timer()
param = [i/1000 for i in range(1,1000)]
param1 = [i for i in range(1,101)]
param.extend(param1)
#progress = progressbar.bar.ProgressBar()
clf = Linear_SVC_Rand(x=x_train, y=y_train, param=param, kfold=3, n=100)
print('LinearSVC:')
print('Best cv accuracy: {}' .format(clf.best_score_))
print('Test set score: {}' .format(clf.score(x_test, y_test)))
print('Best parameters: {}' .format(clf.best_params_))
print()
duration = timer() - start
print('time to run: {}' .format(duration))
First, try to understand this:
https://stats.stackexchange.com/questions/49540/understanding-stratified-cross-validation
So you should know that StratifiedKFold is better than KFold.
Use StratifiedKFold in both GridSearchCV and RandomizedSearchCV. Make sure to set "shuffle = False" and not use "random_state" parameter. What this does: the dataset you are using will not be shuffled so that your results won't be changed each time you train it. You might get what you expect.
GridSearchCV code:
def linear_SVC(x, y, param, kfold):
param_grid = {'C':param}
k = StratifiedKFold(n_splits=kfold)
grid = GridSearchCV(LinearSVC(), param_grid=param_grid, cv=k, n_jobs=4, verbose=1)
return grid.fit(x, y)
RandomizedSearchCV code:
def Linear_SVC_Rand(x, y, param, kfold, n):
param_grid = {'C':param}
k = StratifiedKFold(n_splits=kfold)
randsearch = RandomizedSearchCV(LinearSVC(), param_distributions=param_grid, cv=k, n_jobs=4,
verbose=1, n_iter=n)
return randsearch.fit(x, y)
From what it looks like it is performing properly. Your train/cv set accuracy in gridsearch is higher than train/cv set accuracy in randomized search. The hyper parameters should not be tuned using the test set, so assuming you're doing that properly it might just be a coincidence that the hyper parameters that were chosen from randomized search performed better on the test set.
Related
Problem with predict function of Functional Model
I am trying to combine, nested-cross validation and pipeline with my Functional Model.
This is the code:
binaryModel = is hereby a Functional ANN
grid = dict(ann__n_neurons=[2], ann__num_hidden=[2], ann__used_optimizer=["adam"],
ann__l1_reg=[0.0], ann__l2_reg=[0.0], ann__learning_rate=[0.01],
ann__dropout_rate=[0.0])
X, y = prepare_dataset("", short, bin_categorical, "",
continous_to_binary, target)
cv_outer = KFold(n_splits=10, shuffle=True, random_state=1) #outer cross-validatio 10 times, to test model
# enumerate splits
outer_results = list()
i=0
for train_ix, test_ix in cv_outer.split(X):
print("Outer-Split: ",i)
i+=1
# split data
X_train, X_test = X.iloc[train_ix], X.iloc[test_ix]
y_train, y_test = y[train_ix], y[test_ix]
# configure the cross-validation procedure
cv_inner = KFold(n_splits=3, shuffle=True, random_state=1) #inner cross-validation 3 times, to configure model
# define the model
ann = KerasClassifier(build_fn=binaryModel, input_shape=X_train.shape[1],
batch_size=32,
epochs=10, validation_split=0.2)
# define search
pipe = Pipeline(steps=[('scaler', StandardScaler()), ('ann', ann)])
# define the grid search
cv = GridSearchCV(
pipe, grid, n_jobs=1, cv=cv_inner,refit=True)
# execute search
cv.fit(X_train, y_train, ann__verbose=0)
print('Best score and parameter combination = ')
print(cv.best_score_)
print(cv.best_params_)
print(cv.best_estimator_)
y_predicted = cv.predict(X_test)
Output:
Best score and parameter combination =
0.8449265360832214
{'ann__dropout_rate': 0.0, 'ann__l1_reg': 0.0, 'ann__l2_reg': 0.0, 'ann__learning_rate': 0.01, 'ann__n_neurons': 2, 'ann__num_hidden': 2, 'ann__used_optimizer': 'adam'}
Pipeline(steps=[('scaler', StandardScaler()),
('ann',
<tensorflow.python.keras.wrappers.scikit_learn.KerasClassifier object at 0x7efef01ffd30>)])
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-28-3f1c5b78794d> in <module>
61 print(cv.best_params_)
62 print(cv.best_estimator_[1])
---> 63 y_predicted = cv.predict(X_test)
AttributeError: 'Functional' object has no attribute 'predict_classes'
How to make predictions with the final best model?
my outcome is a Pipeline but i cant use the predict function, why?
i want to use the predict function to evaluate each fold (accuracy, sensitivity and so on...)
The problem is that a Keras Functional API model doesn't have a 'predict_classes' attribute, which is what sklearn's GridSearchCV uses to perform the predictions, only Sequential Keras models have it. I have been running into the same problem, what I would suggest trying is to implement your own GridSearchCV, or try out https://github.com/autonomio/talos which seems promising, though I have not tried it myself yet.
I am trying to solve one problem that resembles that of Fisher's irises classification. The problem is that I can train the model on my computer, but the given model has to predict class membership on a computer where it is impossible to install python and scikit learn. I want to understand how, having received the coefficients of the logistic regression model, I can predict the belonging to a certain class without using the predict method of the model.
Using the Fisher problem as an example, I do the following.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, f1_score
# data preparation
iris = load_iris()
data = pd.DataFrame(data=np.hstack([iris.data, iris.target[:, np.newaxis]]),
columns=iris.feature_names + ['target'])
names = data.columns
# split data
X_train, X_test, y_train, y_test = train_test_split(data[names[:-1]], data[names[-1]], random_state=42)
# train model
cls = make_pipeline(
StandardScaler(),
LogisticRegression(C=2, random_state=42)
)
cls = cls.fit(X_train.to_numpy(), y_train)
preds_train = cls.predict(X_train)
# prediction
preds_test = cls.predict(X_test)
# scores
train_score = accuracy_score(preds_train, y_train), f1_score(preds_train, y_train, average='macro') # on train data
# train_score = (0.9642857142857143, 0.9653621232568601)
test_score = accuracy_score(preds_test, y_test), f1_score(preds_test, y_test, average='macro') # on test data
# test_score = (1.0, 1.0)
# model coefficients
cls[1].coef_, cls[1].intercept_
>>> (array([[-1.13948079, 1.30623841, -2.21496793, -2.05617771],
[ 0.66515676, -0.2541143 , -0.55819748, -0.86441227],
[ 0.47432404, -1.05212411, 2.77316541, 2.92058998]]),
array([-0.35860337, 2.43929019, -2.08068682]))
Now I have the coefficients of the model. And I want to use them to make predictions.
First, I make a prediction using the predict method for the first five observations on the test sample.
preds_test = cls.predict_proba(X_test)
preds_test[0:5]
>>>array([[5.66019001e-03, 9.18455687e-01, 7.58841233e-02],
[9.75854479e-01, 2.41455095e-02, 1.10881450e-08],
[1.18780156e-09, 6.53295166e-04, 9.99346704e-01],
[6.71574900e-03, 8.14174200e-01, 1.79110051e-01],
[6.98756622e-04, 8.09096425e-01, 1.90204818e-01]])
Then I manually calculate the predictions of the class probabilities for the observations using the coefficients of the model.
# define two functions for making predictions
def logit(x, w):
return np.dot(x, w)
# from here: https://stackoverflow.com/questions/34968722/how-to-implement-the-softmax-function-in-python
def softmax(z):
assert len(z.shape) == 2
s = np.max(z, axis=1)
s = s[:, np.newaxis] # necessary step to do broadcasting
e_x = np.exp(z - s)
div = np.sum(e_x, axis=1)
div = div[:, np.newaxis] # dito
return e_x / div
n, k = X_test.shape
X_ = np.hstack((np.ones((n, 1)), X_test)) # add column with 1 for intercept
weights = np.hstack((cls[1].intercept_[:, np.newaxis], cls[1].coef_)) # create weights matrix
results = softmax(logit(X_, weights.T)) # calculate probabilities
results[0:5]
>>>array([[3.67343725e-14, 4.63938438e-06, 9.99995361e-01],
[2.81976786e-05, 8.63083152e-01, 1.36888650e-01],
[1.24572182e-22, 5.47800683e-11, 1.00000000e+00],
[3.32990060e-14, 3.08352323e-06, 9.99996916e-01],
[2.66415118e-15, 1.78252465e-06, 9.99998217e-01]])
If you compare the two results obtained (preds_test[0:5] and results[0:5]), you can see that they do not coincide at all. Please explain me what I am doing wrong and how I can use the model's coefficients to calculate predictions without using the predict method.
I forgot that a scaler was applied. If you change the code a little, then the results are the same.
scaler = StandardScaler()
scaler.fit(X_train)
X_test_transf = scaler.transform(X_test)
def logit(x, w):
return np.dot(x, w)
def softmax(z):
assert len(z.shape) == 2
s = np.max(z, axis=1)
s = s[:, np.newaxis] # necessary step to do broadcasting
e_x = np.exp(z - s)
div = np.sum(e_x, axis=1)
div = div[:, np.newaxis] # dito
return e_x / div
n, k = X_test_transf.shape
X_ = np.hstack((np.ones((n, 1)), X_test_transf))
weights = np.hstack((cls[1].intercept_[:, np.newaxis], cls[1].coef_))
results = softmax(logit(X_, weights.T))
np.allclose(preds_test, results)
>>>True
There are two values for every predict_proba. The first value is the probability of the event not occurring and the probability of the event occurring. predict_proba(X)[:,1] to get the probability of the event occurring.
I found and successfully tested following script that applies Pipeline and GridSearchCV to classifier selection. The script outputs the best classifier and its accuracy.
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn import datasets
iris = datasets.load_iris()
X_train = iris.data
y_train = iris.target
X_test = iris.data[:10] # Augmenting test data
y_test = iris.target[:10] # Augmenting test data
#Create a pipeline
pipe = Pipeline([('classifier', LogisticRegression())])
# Create space of candidate learning algorithms and their hyperparameters
search_space = [{'classifier': [LogisticRegression()],
'classifier__penalty': ['l1', 'l2'],
'classifier__C': np.logspace(0, 4, 10)},
{'classifier': [RandomForestClassifier()],
'classifier__n_estimators': [10, 100, 1000],
'classifier__max_features': [1, 2, 3]}]
# Create grid search
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0)
# Fit grid search
best_model = clf.fit(X_train, y_train)
print('Best training accuracy: %.3f' % best_model.best_score_)
print('Best estimator:', best_model.best_estimator_.get_params()['classifier'])
# Predict on test data with best params
y_pred = best_model.predict(X_test)
# Test data accuracy of model with best params
print(classification_report(y_test, y_pred, digits=4))
print('Test set accuracy score for best params: %.3f' % accuracy_score(y_test, y_pred))
from sklearn.metrics import precision_recall_fscore_support
print(precision_recall_fscore_support(y_test, y_pred,
average='weighted'))
How can I adjust the script so that it not only outputs the best classifier, which is LogReg in our example, but also the best selected among the other classifiers? Above, I like to see the output from RandomForestClassifier(), too.
Ideal is a solution where the best classifier for each algorithm (LogReg, RandomForest,..) is shown and where each of those best classifiers is sorted into a table. The first column or index should be the model and precision_recall_fscore_support values are in rows on the right. The table should then be sorted by F-score.
PS: Though the script works, I'm yet unsure what the function of LogisticRegression() in the Pipeline is, as it's defined in the search space later.
Solution (simplified):
from sklearn import datasets
iris = datasets.load_iris()
X_train = iris.data
y_train = iris.target
X_test = iris.data[:10]
y_test = iris.target[:10]
seed=1
models = [
'RFC',
'logisticRegression'
]
clfs = [
RandomForestClassifier(random_state=seed,n_jobs=-1),
LogisticRegression()
]
params = {
models[0]:{'n_estimators':[100]},
models[1]: {'C':[1000]}
}
for name, estimator in zip(models,clfs):
print(name)
clf = GridSearchCV(estimator, params[name], scoring='accuracy', refit='True', n_jobs=-1, cv=5)
clf.fit(X_train, y_train)
print("best params: " + str(clf.best_params_))
print("best scores: " + str(clf.best_score_))
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.4%}".format(acc))
print(classification_report(y_test, y_pred, digits=4))
If I understood correctly, this should work fine.
import pandas as pd
import numpy as np
df = pd.DataFrame(list(best_model.cv_results_['params']))
ranking = best_model.cv_results_['rank_test_score']
# The sorting is done based on the test_score of the models.
sorting = np.argsort(best_model.cv_results_['rank_test_score'])
# Sort the lines based on the ranking of the models
df_final = df.iloc[sorting]
# The first line contains the best model and its parameters
df_final.to_csv('sorted_table.csv')
# OR to avoid the index in the writting
df_final.to_csv('sorted_table2.csv',index=False)
Results:
However, in this case, the ordering is not done based on the F values. To do so use this. Define in the GridSearch the scoring attribute to f1_weighted and repeat my code.
Example:
...
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0,scoring='f1_weighted')
best_model = clf.fit(X_train, y_train)
df = pd.DataFrame(list(best_model.cv_results_['params']))
ranking = best_model.cv_results_['rank_test_score']
# The sorting is done based on the F values of the models.
sorting = np.argsort(best_model.cv_results_['rank_test_score'])
# Sort the lines based on the ranking of the models
df_final = df.iloc[sorting]
df_final.to_csv('F_sorted_table.csv')
Results:
I am trying to use GridSearchCV along with KerasRegressor for hyperparameter search. Keras model.fit function on its own allows to look at the 'loss' and 'val_loss' variables using the history object.
Is it possible to look at the 'loss' and 'val_loss' variables when using GridSearchCV.
Here is the code i am using to do a gridsearch:
model = KerasRegressor(build_fn=create_model_gridsearch, verbose=0)
layers = [[16], [16,8]]
activations = ['relu' ]
optimizers = ['Adam']
param_grid = dict(layers=layers, activation=activations, input_dim=[X_train.shape[1]], output_dim=[Y_train.shape[1]], batch_size=specified_batch_size, epochs=num_of_epochs, optimizer=optimizers)
grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1, cv=7)
grid_result = grid.fit(X_train, Y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in sorted(zip(means, stds, params), key=lambda x: x[0]):
print("%f (%f) with: %r" % (mean, stdev, param))
def create_model_gridsearch(input_dim, output_dim, layers, activation, optimizer):
model = Sequential()
for i, nodes in enumerate(layers):
if i == 0:
model.add(Dense(nodes, input_dim=input_dim))
model.add(Activation(activation))
else:
model.add(Dense(nodes))
model.add(Activation(activation))
model.add(Dense(output_dim, activation='linear'))
model.compile(optimizer=optimizer, loss='mean_squared_error')
return model
How can i get the training and CV loss per epoch for the best model, grid_result.best_estimator_.model?
There is no variable like grid_result.best_estimator_.model.history.keys()
The history is well hidden. I was able to find it in
grid_result.best_estimator_.model.model.history.history
There is slight change in above answer.
"grid_result.best_estimator_.model.history.history" will give history object.
I am tryting to use sklearn's gridsearch with a model created by xgboost. To do this, I am creating a custom scorer based on ndcg evaluation. I am successfully able to use Snippet 1 but it is too messy / hacky, I would prefer to use good old sklearn to simplify the code. I tried to implement GridSearch and the results is completely off: for the same X and y sets I get NDCG#k = 0.8 with Snippet 1 versus 0.5 with Snippet 2. Obviously there something I am not doing right here ...
The following pieces of code return very different results:
Snippet1:
kf = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=42)
max_depth = [6]
learning_rate = [0.22]
n_estimators = [43]
reg_alpha = [0.1]
reg_lambda = [10]
for md in max_depth:
for lr in learning_rate:
for ne in n_estimators:
for ra in reg_alpha:
for rl in reg_lambda:
xgb = XGBClassifier(objective='multi:softprob',
max_depth=md,
learning_rate=lr,
n_estimators=ne,
reg_alpha=ra,
reg_lambda=rl,
subsample=0.6, colsample_bytree=0.6, seed=0)
print([md, lr, ne])
score = []
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
xgb.fit(X_train, y_train)
y_pred = xgb.predict_proba(X_test)
score.append(ndcg_scorer(y_test, y_pred))
print('all scores: %s' % score)
print('average score: %s' % np.mean(score))
Snippet2:
from sklearn.grid_search import GridSearchCV
params = {
'max_depth':[6],
'learning_rate':[0.22],
'n_estimators':[43],
'reg_alpha':[0.1],
'reg_lambda':[10],
'subsample':[0.6],
'colsample_bytree':[0.6]
}
xgb = XGBClassifier(objective='multi:softprob',seed=0)
scorer = make_scorer(ndcg_scorer, needs_proba=True)
gs = GridSearchCV(xgb, params, cv=5, scoring=scorer, verbose=10, refit=False)
gs.fit(X,y)
gs.best_score_
While snippet1 gives me the result as expected, the score returned by Snippet2 is not consistent with the ndcg_scorer.
The problem is with cv inGridSearchCV(xgb, params, cv=5, scoring=scorer, verbose=10, refit=False). It can recieve a KFold / StratifiedKFold instead of an int. Unlike what is says in the doc it seems that by default an agrument of type 'int' is not calling StratifiedKFold another function maybe KFold.