ValueError while finding best hyperparameter in Scikit learn LogisticRegression using GridSearchCV - scikit-learn

While doing hyperparameter tuning using GridSearchCV for LogisticRegression, I am getting error as
ValueError: Invalid parameter Hparam
For estimator:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=-1, penalty='l1', random_state=None, solver='liblinear', tol=0.0001, verbose=1, warm_start=False)
I've written my code below:
hparam=[]
a = 0.0001
while(a<100000):
hparam.append(a)
a*=2
LReg = LogisticRegression(penalty='l1',verbose=1,n_jobs=-1)
param_grid = {'Hparam':hparam}
grid_ = GridSearchCV(LReg, param_grid, scoring='roc_auc', cv=10)
grid_.fit(xtr_,ytr_)

Refer sci-kit Logistic Regression, Hparam is not listed as a hyper parameter for Logistic Regression

Related

Linear regression parameters not appearning after fitting [duplicate]

from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel
The output of the above code is just
LogisticRegression()
But I expected something more detailed, including the model parameters, i.e.:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
What am I doing wrong?
This is due to a change in the default configuration settings from scikit-learn v0.23 onwards; from the changelog:
The default setting print_changed_only has been changed from False to True. This means that the repr of estimators is now more concise and only shows the parameters whose default value has been changed when printing an estimator. You can restore the previous behaviour by using sklearn.set_config(print_changed_only=False). Also, note that it is always possible to quickly inspect the parameters of any estimator using est.get_params(deep=False).
In other words, in versions before v0.23, the following code:
import sklearn
sklearn.__version__
# 0.22.2
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr
produces the following output with all model parameters:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
But the same code from v0.23 onwards:
import sklearn
sklearn.__version__
# 0.23.2
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr
will produce just:
LogisticRegression()
in cases like here, i.e. where no parameter has been explicitly defined, and all remain in their default values. And that's because the print_changed_only parameter is now set by default to True:
sklearn.get_config()
# result:
{'assume_finite': False,
'working_memory': 1024,
'print_changed_only': True,
'display': 'text'}
To get all the parameters printed in the newer scikit-learn versions, you should either do
lr.get_params()
# result
{'C': 1.0,
'class_weight': None,
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'l1_ratio': None,
'max_iter': 100,
'multi_class': 'auto',
'n_jobs': None,
'penalty': 'l2',
'random_state': None,
'solver': 'lbfgs',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}
or change the setting (preferable, since it will affect any and all models used afterwards):
sklearn.set_config(print_changed_only=False) # needed only once
lr # as defined above
# result
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)

Adaboost in Pipeline with Gridsearch SKLEARN

I would like to use the AdaBoostClassifier with LinearSVC as base estimator. I want to do a gridsearch on some of the parameters in LinearSVC. Also I have to scale my features.
p_grid = {'base_estimator__C': np.logspace(-5, 3, 10)}
n_splits = 5
inner_cv = StratifiedKFold(n_splits=n_splits,
shuffle=True, random_state=5)
SVC_Kernel=LinearSVC(multi_class ='crammer_singer',tol=10e-3,max_iter=10000,class_weight='balanced')
ABC = AdaBoostClassifier(base_estimator=SVC_Kernel,n_estimators=600,learning_rate=1.5,algorithm="SAMME")
for train_index, test_index in kk.split(input):
X_train, X_test = input[train_index], input[test_index]
y_train, y_test = target[train_index], target[test_index]
pipe_SVC = Pipeline([('scaler', RobustScaler()),('AdaBoostClassifier', ABC)])
clfSearch = GridSearchCV(estimator=pipe_SVC, param_grid=p_grid,
cv=inner_cv, scoring='f1_macro', iid=False, n_jobs=-1)
clfSearch.fit(X_train, y_train)
The following error occurs:
ValueError: Invalid parameter base_estimator for estimator Pipeline(memory=None,
steps=[('scaler',
RobustScaler(copy=True, quantile_range=(25.0, 75.0),
with_centering=True, with_scaling=True)),
('AdaBoostClassifier',
AdaBoostClassifier(algorithm='SAMME',
base_estimator=LinearSVC(C=1.0,
class_weight='balanced',
dual=True,
fit_intercept=True,
intercept_scaling=1,
loss='squared_hinge',
max_iter=10000,
multi_class='crammer_singer',
penalty='l2',
random_state=None,
tol=0.01,
verbose=0),
learning_rate=1.5, n_estimators=600,
random_state=None))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
Without the AdaBoostClassifier the pipeline is working, so I think there is the problem.
I think your p_grid should be defined as follows,
p_grid = {'AdaBoostClassifier__base_estimator__C': np.logspace(-5, 3, 10)}
Try pipe_SVC.get_params(), if you are not sure about the name of your parameter.

Keras NN model isn't compatible with sklearn's votingClassifier

I'm trying to ensemble my models and use a votingclassifier from sklearn to get an accuracy score. Right now, my keras model (NN) doesn't fit in the ensemble fitting. Here's my code:
I've tried using skLearn's NN, kerasClassifier. Basically, i've ran out of options.
def multiLayerPerceptionModel(nb_epochs, hidden_1, hidden_2,
learn_rate, batch_size, num_input, num_classes, path_log, X_Train,
X_Test, Y_Train, Y_Test):
model_RN = keras.Sequential([
keras.layers.Dense(hidden_1, activation=tf.nn.relu),
keras.layers.Dense(hidden_2, activation=tf.nn.relu),
keras.layers.Dense(num_classes, activation='softmax'),
BatchNormalization(),
])
model_RN.add(layers.Dense(num_classes, activation='softmax'))
model_RN.compile(optimizer=SGD(learn_rate),loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model_RN.fit(X_Train, Y_Train, validation_data
(X_Test,Y_Test),epochs=nb_epochs, batch_size=batch_size, callbacks=
[tensorboard])
model_RN.add(Flatten())
TypeError: Cannot clone object
'<tensorflow.python.keras.engine.sequential.Sequential object at
0x11778d400>' (type <class
'tensorflow.python.keras.engine.sequential.Sequential'>): it does
not seem to be a scikit-learn estimator as it does not implement a
'get_params' methods.

Stacking StandardScaler() with RFECV and GridSearchCV

So I found out that StandardScaler() can make my RFECV inside my GridSearchCV with each on a nested 3-fold cross validation run faster. Without StandardScaler(), my code was running for more than 2 days, so I canceled and decided to inject StandardScaler into the process. But now it is has been running for more than 4 hours and I am not sure if I have done it right. Here is my code:
# Choose Linear SVM as classifier
LSVM = SVC(kernel='linear')
selector = RFECV(LSVM, step=1, cv=3, scoring='f1')
param_grid = [{'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100]}]
clf = make_pipeline(StandardScaler(),
GridSearchCV(selector,
param_grid,
cv=3,
refit=True,
scoring='f1'))
clf.fit(X, Y)
I think I haven't gotten it right to be honest because I think the StandardScaler() should be put inside the GridSearchCV() function for it to normalize the data each fold, not only just once (?). Please correct me if I am wrong or if my pipeline is incorrect and hence why it is still running for a long time.
I have 8,000 rows of 145 features to be pruned by RFECV, and 6 C-Values to be pruned by GridSearchCV. So for each C-Value, the best feature set is determined by the RFECV.
Thanks!
Update:
So I put the StandardScaler inside the RFECV like this:
clf = SVC(kernel='linear')
kf = KFold(n_splits=3, shuffle=True, random_state=0)
estimators = [('standardize' , StandardScaler()),
('clf', clf)]
class Mypipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
pipeline = Mypipeline(estimators)
rfecv = RFECV(estimator=pipeline, cv=kf, scoring='f1', verbose=10)
param_grid = [{'estimator__svc__C': [0.001, 0.01, 0.1, 1, 10, 100]}]
clf = GridSearchCV(rfecv, param_grid, cv=3, scoring='f1', verbose=10)
But it still throws out the following error:
ValueError: Invalid parameter C for estimator Pipeline(memory=None,
steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, >with_std=True)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, >coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False))]). Check the list of available parameters with >estimator.get_params().keys().
Kumar is right. Also, what You might want to do, turn on verbose in the GridSearchCV. Also, You could add a limit to the number of iterations of the SVC, starting from a very small number, like 5, just to make sure that the problem is not with the convergence.

From SKLearn to Keras - What is the difference?

I'm trying to go from SKLearn to Keras in order to make specific improvements to my models.
However, I can't get the same performance I had with my SKLearn model :
mlp = MLPClassifier(
solver='adam', activation='relu',
beta_1=0.9, beta_2=0.999, learning_rate='constant',
alpha=0, hidden_layer_sizes=(238,),
max_iter=300
)
dev_score(mlp)
Gives ~0.65 score everytime
Here is my corresponding Keras code :
def build_model(alpha):
level_moreargs = {'kernel_regularizer':l2(alpha), 'kernel_initializer': 'glorot_uniform'}
model = Sequential()
model.add(Dense(units=238, input_dim=X.shape[1], **level_moreargs))
model.add(Activation('relu'))
model.add(Dense(units=class_names.shape[0], **level_moreargs)) # output
model.add(Activation('softmax'))
model.compile(loss=keras.losses.categorical_crossentropy, # like sklearn
optimizer=keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0),
metrics=['accuracy'])
return model
k_dnn = KerasClassifier(build_fn=build_model, epochs=300, batch_size=200, validation_data=None, shuffle=True, alpha=0.5, verbose=0)
dev_score(k_dnn)
From looking at the documentation (and digging into SKLearn code), this should correspond exactly to the same thing.
However, I get ~0.5 accuracy when I run this model, which is very bad.
And if I set alpha to 0, SKLearn's score barely changes (0.63), while Keras's goes random from 0.2 to 0.4.
What is the difference between these models ? Why is Keras, although being supposed to be better than SKLearn, outperformed by so far here ? What's my mistake ?
Thanks,

Resources