I don't understand the get_params([deep]) method available for TruncatedSVD in sklearn. Can some please explain it to me?
Check out the source of get_params here: https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/base.py#L213
Not just TruncatedSVD, basically all of the scikit-estimators contain this method because they all inherit this method from the BaseEstimator class.
Ans as the name says, it will give out the values of the parameters set in the class. In your case, check out the list of parameters here: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
n_components : int, default = 2
algorithm : string, default = “randomized"
n_iter : int, optional (default 5)
random_state : int, RandomState instance or None, optional, default = None
tol : float, optional
Lets say you initialize the TruncatedSVD with the following code:
svd = TruncatedSVD(n_components=5, n_iter=7, random_state=42)
The output will be:
{'algorithm': 'randomized',
'n_components': 5,
'n_iter': 7,
'random_state': 42,
'tol': 0.0}
This is useful for making the clone of the object and is used extensively in various scikit learn utilities like cross_val_score, GridSearchCV, Pipeline etc.
If deep=True, it will just return the parameters of the inner estimators if any.
For example take this code:
from sklearn import svm
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.pipeline import Pipeline
anova_filter = SelectKBest(f_regression, k=5)
clf = svm.SVC(kernel='linear')
anova_svm = Pipeline([('anova', anova_filter), ('svc', clf)])
The output of anova_svm.get_params(deep=False) is below:
{'memory': None,
'steps': [('anova',
SelectKBest(k=5, score_func=<function f_regression at 0x7fb34d50ede8>)),
('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False))]}
And below is the code of anova_svm.get_params(True):
{'anova': SelectKBest(k=5, score_func=<function f_regression at 0x7fb34d50ede8>),
'anova__k': 5,
'anova__score_func': <function sklearn.feature_selection.univariate_selection.f_regression>,
'memory': None,
'steps': [('anova',
SelectKBest(k=5, score_func=<function f_regression at 0x7fb34d50ede8>)),
('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False))],
'svc': SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False),
'svc__C': 1.0,
'svc__cache_size': 200,
'svc__class_weight': None,
'svc__coef0': 0.0,
'svc__decision_function_shape': 'ovr',
'svc__degree': 3,
'svc__gamma': 'auto',
'svc__kernel': 'linear',
'svc__max_iter': -1,
'svc__probability': False,
'svc__random_state': None,
'svc__shrinking': True,
'svc__tol': 0.001,
'svc__verbose': False}
You can see that now the output contains the values of parameters of svm and selectkbest which are internal estimators of pipeline estimator.
Related
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel
The output of the above code is just
LogisticRegression()
But I expected something more detailed, including the model parameters, i.e.:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
What am I doing wrong?
This is due to a change in the default configuration settings from scikit-learn v0.23 onwards; from the changelog:
The default setting print_changed_only has been changed from False to True. This means that the repr of estimators is now more concise and only shows the parameters whose default value has been changed when printing an estimator. You can restore the previous behaviour by using sklearn.set_config(print_changed_only=False). Also, note that it is always possible to quickly inspect the parameters of any estimator using est.get_params(deep=False).
In other words, in versions before v0.23, the following code:
import sklearn
sklearn.__version__
# 0.22.2
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr
produces the following output with all model parameters:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
But the same code from v0.23 onwards:
import sklearn
sklearn.__version__
# 0.23.2
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr
will produce just:
LogisticRegression()
in cases like here, i.e. where no parameter has been explicitly defined, and all remain in their default values. And that's because the print_changed_only parameter is now set by default to True:
sklearn.get_config()
# result:
{'assume_finite': False,
'working_memory': 1024,
'print_changed_only': True,
'display': 'text'}
To get all the parameters printed in the newer scikit-learn versions, you should either do
lr.get_params()
# result
{'C': 1.0,
'class_weight': None,
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'l1_ratio': None,
'max_iter': 100,
'multi_class': 'auto',
'n_jobs': None,
'penalty': 'l2',
'random_state': None,
'solver': 'lbfgs',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}
or change the setting (preferable, since it will affect any and all models used afterwards):
sklearn.set_config(print_changed_only=False) # needed only once
lr # as defined above
# result
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
I'm trying to use the Pipeline class from imblearn and GridSearchCV to get the best parameters for classifying the imbalanced dataset. As per the answers mentioned here, I want to leave out resampling of the validation set and only resample the training set, which imblearn's Pipeline seems to be doing. However, I'm getting an error while implementing the accepted solution. Please let me know what am I doing wrong. Below is my implementation:
def imb_pipeline(clf, X, y, params):
model = Pipeline([
('sampling', SMOTE()),
('classification', clf)
])
score={'AUC':'roc_auc',
'RECALL':'recall',
'PRECISION':'precision',
'F1':'f1'}
gcv = GridSearchCV(estimator=model, param_grid=params, cv=5, scoring=score, n_jobs=12, refit='F1',
return_train_score=True)
gcv.fit(X, y)
return gcv
for param, classifier in zip(params, classifiers):
print("Working on {}...".format(classifier[0]))
clf = imb_pipeline(classifier[1], X_scaled, y, param)
print("Best parameter for {} is {}".format(classifier[0], clf.best_params_))
print("Best `F1` for {} is {}".format(classifier[0], clf.best_score_))
print('-'*50)
print('\n')
params:
[{'penalty': ('l1', 'l2'), 'C': (0.01, 0.1, 1.0, 10)},
{'n_neighbors': (10, 15, 25)},
{'n_estimators': (80, 100, 150, 200), 'min_samples_split': (5, 7, 10, 20)}]
classifiers:
[('Logistic Regression',
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=None, solver='warn', tol=0.0001, verbose=0,
warm_start=False)),
('KNearestNeighbors',
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')),
('Gradient Boosting Classifier',
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_iter_no_change=None, presort='auto',
random_state=None, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False))]
Error:
ValueError: Invalid parameter C for estimator Pipeline(memory=None,
steps=[('sampling',
SMOTE(k_neighbors=5, kind='deprecated',
m_neighbors='deprecated', n_jobs=1,
out_step='deprecated', random_state=None, ratio=None,
sampling_strategy='auto', svm_estimator='deprecated')),
('classification',
LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1,
l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None,
penalty='l2', random_state=None,
solver='warn', tol=0.0001, verbose=0,
warm_start=False))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`. """
Please check this example how to use parameters with a Pipeline:
- https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py
Whenever using the pipeline, you will need to send the parameters in a way so that pipeline can understand which parameter is for which of the step in the list. For that it uses the name you provided during Pipeline initialisation.
In your code, for example:
model = Pipeline([
('sampling', SMOTE()),
('classification', clf)
])
To pass the parameter p1 to SMOTE you would use sampling__p1 as a parameter, not p1.
You used "classification" as a name for your clf so append that to the parameters which are supposed to go to the clf.
Try:
[{'classification__penalty': ('l1', 'l2'), 'classification__C': (0.01, 0.1, 1.0, 10)},
{'classification__n_neighbors': (10, 15, 25)},
{'classification__n_estimators': (80, 100, 150, 200), 'min_samples_split': (5, 7, 10, 20)}]
Make sure there are two underscores between the name and the parameter.
I am trying to run GradientBoostingClassifier() with the help of gridsearchcv.
For every combination of parameter, I also need "Precison", "recall" and accuracy in tabular format.
Here is the code:
scoring= ['accuracy', 'precision','recall']
parameters = {#'nthread':[3,4], #when use hyperthread, xgboost may become slower
"criterion": ["friedman_mse", "mae"],
"loss":["deviance","exponential"],
"max_features":["log2","sqrt"],
'learning_rate': [0.01,0.05,0.1,1,0.5], #so called `eta` value
'max_depth': [3,4,5],
'min_samples_leaf': [4,5,6],
'subsample': [0.6,0.7,0.8],
'n_estimators': [5,10,15,20],#number of trees, change it to 1000 for better results
'scoring':scoring
}
# sorted(sklearn.metrics.SCORERS.keys()) # To see different loss functions
#clf_xgb = GridSearchCV(xgb_model, parameters, n_jobs=5,verbose=2, refit=True,cv = 8)
clf_gbm = GridSearchCV(gbm_model, parameters, n_jobs=5,cv = 8)
clf_gbm.fit(X_train,y_train)
print(clf_gbm.best_params_)
print(clf_gbm.best_score_)
feature_importances = pd.DataFrame(clf_gbm.best_estimator_.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances)
depth=clf_gbm.cv_results_["param_max_depth"]
score=clf_gbm.cv_results_["mean_test_score"]
params=clf_gbm.cv_results_["params"]
I get error as:
ValueError: Invalid parameter seed for estimator GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.01, loss='deviance', max_depth=3,
max_features='log2', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=4, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=5, presort='auto',
random_state=None, subsample=1.0, verbose=0,
warm_start=False). Check the list of available parameters with `estimator.get_params().keys()`.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import make_scorer
#creating Scoring parameter:
scoring = {'accuracy': make_scorer(accuracy_score),
'precision': make_scorer(precision_score),'recall':make_scorer(recall_score)}
# A sample parameter
parameters = {
"loss":["deviance"],
"learning_rate": [0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2],
"min_samples_split": np.linspace(0.1, 0.5, 12),
"min_samples_leaf": np.linspace(0.1, 0.5, 12),
"max_depth":[3,5,8],
"max_features":["log2","sqrt"],
"criterion": ["friedman_mse", "mae"],
"subsample":[0.5, 0.618, 0.8, 0.85, 0.9, 0.95, 1.0],
"n_estimators":[10]
}
#passing the scoring function in the GridSearchCV
clf = GridSearchCV(GradientBoostingClassifier(), parameters,scoring=scoring,refit=False,cv=2, n_jobs=-1)
clf.fit(trainX, trainY)
#converting the clf.cv_results to dataframe
df=pd.DataFrame.from_dict(clf.cv_results_)
#here Possible inputs for cross validation is cv=2, there two split split0 and split1
df[['split0_test_accuracy','split1_test_accuracy','split0_test_precision','split1_test_precision','split0_test_recall','split1_test_recall']]
find the best parameter based on the accuracy_score, precision_score or recall and refit the model and prediction on the test data
#find the best parameter based on the accuracy_score
#taking the average of the accuracy_score
df['accuracy_score']=(df['split0_test_accuracy']+df['split1_test_accuracy'])/2
df.loc[df['accuracy_score'].idxmax()]['params']
Prediction on the test data
clf =GradientBoostingClassifier(criterion='mae',
learning_rate=0.1,
loss='deviance',
max_depth= 5,
max_features='sqrt',
min_samples_leaf= 0.1,
min_samples_split= 0.42727272727272736,
n_estimators=10,
subsample=0.8)
clf.fit(trainX, trainY)
correct_test = correct_data(test)
testX = correct_test[predictor].values
result = clf.predict(testX)
The question is simple, I have a 3D image and I want to segment them using SVM. So I converted the input and output images to 3D numpy array, and now I want to use SVM. But it seems that clf.fit() does not support multidimensional label. So how can I train my model where label is multidimensional array?
A simple e.g.:
from sklearn import svm
x=[[0,0],[1,1]]
y=[[0,0],[1,1]]
clf=svm.SVC(gamma='scale')
clf.fit(x,y)
Error is:
Traceback (most recent call last):
File "basic.py", line 5, in <module>
clf.fit(x,y)
File "/usr/local/lib/python3.5/dist-packages/sklearn/svm/base.py", line 149, in fit
accept_large_sparse=False)
File "/usr/local/lib/python3.5/dist-packages/sklearn/utils/validation.py", line 761, in check_X_y
y = column_or_1d(y, warn=True)
File "/usr/local/lib/python3.5/dist-packages/sklearn/utils/validation.py", line 797, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (2, 2)
You're adding to different y-class labels and that's why its not working. See solution with inline comments below.
from sklearn import svm
x=[[0,0],[1,1],[7,8]]
y=[0,1, 2] # class labels
clf=svm.SVC() # clf=svm.SVC(gamma='scale') > gamma is auto. no need to add this.
print (clf.fit(x,y))
q = clf.predict([[2., 2.]]) # simple example to test prediction.
print ('array : %s ' % q)
# use of multiple class labes for y
x=[[0,0],[1,1]]
y=[[0,1],[0,2]] # the value 2 is to show the difference in printed output.
# add here your `for item in x:` if both arrays are 3D. `for item in y:` needs
# indentation if you do.
for item in y: # iters through the labeling list.
print (item)
clf=svm.SVC()
print (clf.fit(x,item))
q = clf.predict([[2., 2.]])
print ('array : %s ' % q)
Printed result:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
array : [1]
[0, 1]
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
array : [1]
[0, 2]
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
array : [2]
So I found out that StandardScaler() can make my RFECV inside my GridSearchCV with each on a nested 3-fold cross validation run faster. Without StandardScaler(), my code was running for more than 2 days, so I canceled and decided to inject StandardScaler into the process. But now it is has been running for more than 4 hours and I am not sure if I have done it right. Here is my code:
# Choose Linear SVM as classifier
LSVM = SVC(kernel='linear')
selector = RFECV(LSVM, step=1, cv=3, scoring='f1')
param_grid = [{'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100]}]
clf = make_pipeline(StandardScaler(),
GridSearchCV(selector,
param_grid,
cv=3,
refit=True,
scoring='f1'))
clf.fit(X, Y)
I think I haven't gotten it right to be honest because I think the StandardScaler() should be put inside the GridSearchCV() function for it to normalize the data each fold, not only just once (?). Please correct me if I am wrong or if my pipeline is incorrect and hence why it is still running for a long time.
I have 8,000 rows of 145 features to be pruned by RFECV, and 6 C-Values to be pruned by GridSearchCV. So for each C-Value, the best feature set is determined by the RFECV.
Thanks!
Update:
So I put the StandardScaler inside the RFECV like this:
clf = SVC(kernel='linear')
kf = KFold(n_splits=3, shuffle=True, random_state=0)
estimators = [('standardize' , StandardScaler()),
('clf', clf)]
class Mypipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
pipeline = Mypipeline(estimators)
rfecv = RFECV(estimator=pipeline, cv=kf, scoring='f1', verbose=10)
param_grid = [{'estimator__svc__C': [0.001, 0.01, 0.1, 1, 10, 100]}]
clf = GridSearchCV(rfecv, param_grid, cv=3, scoring='f1', verbose=10)
But it still throws out the following error:
ValueError: Invalid parameter C for estimator Pipeline(memory=None,
steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, >with_std=True)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, >coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False))]). Check the list of available parameters with >estimator.get_params().keys().
Kumar is right. Also, what You might want to do, turn on verbose in the GridSearchCV. Also, You could add a limit to the number of iterations of the SVC, starting from a very small number, like 5, just to make sure that the problem is not with the convergence.