Why SVC repeates it parameter in the output - python-3.x

My aim is to classify the extracted features from CNN using support vector machine.
Extracted features have a shape (2186, 128), which is an np array saved in X_tr.
Y has the shape (2186,) an array([0, 0, 0, ..., 0, 0, 0])
Applying these to SVC.
Input:
from sklearn.svm import SVC
clf = SVC()
clf.fit(X_train, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Output:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Why is it giving parameters as output instead of classification?

What are you expecting to see? You have trained the classifier on your training data and now you need to evaluate the classifier on your test data. In scikit-learn, you train a classifier using:
clf.fit(X_train, y_train)
and you make a prediction with the trained classifier using something like:
predictions = clf.predict(X_test)

from sklearn.svm import SVC
clf = SVC()
clf.fit(X_train, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
wont give any output.
If you want to test classification and prediction then use
from sklearn.svm import SVC
clf = SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
clf.fit(X_train, y)
pred = clf.predict(X_test)
print pred
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
In SVC C, cache_size, class_weight etc. these are the parameters which SVC takes. These parameters you can use for tuning like you want to use 'linear' or 'rbf' kernel with 'C:1000'.
For more info please check: http://scikit-learn.org/stable/modules/svm.html

Related

Sklearn gridsearchcv score not matching

Variable names for my training and test data are X_train, X_test, Y_train, Y_test
I have ran a GridSearchCV instance from sklearn to do hyperparameter tuning for my random forest model.
param_grid = {
'n_estimators': [500],
'max_features': ['sqrt', None],
'max_depth': [ 6 ],
'max_leaf_nodes': [8],
'min_impurity_decrease':[0,0.02],
'min_samples_split':[2]
}
grid_search= GridSearchCV(RandomForestClassifier(
criterion='gini',
min_weight_fraction_leaf=0.0,
bootstrap=True,
n_jobs=-1,
random_state=1, verbose=0,
warm_start=False, class_weight='balanced',
ccp_alpha=0.0,
max_samples=None),
param_grid=param_grid,verbose=50,cv=2,n_jobs=-1,scoring='balanced_accuracy')
grid_search.fit(X_train, Y_train)
All the scores that I can see while the gridseach is training are in the range of 0.4-.6
Following is the output of the best score:
[CV 2/2; 1/4] END max_depth=6, max_features=sqrt, max_leaf_nodes=8, min_impurity_decrease=0, min_samples_split=2, n_estimators=500;, score=0.552 total time= 15.4s
My questions is when I am manually calculating balanced_accuracy using
from sklearn.metrics import balanced_accuracy_score, by running print('training accuracy', balanced_accuracy_score(grid_search.predict(X_train), Y_train,adjusted=False)), I am getting a value of about 0.96 which is very different from what the output of gridsearchcv is showing during the run. Why is this so? And what does the score in gridsearchcv mean then? Please note I have passed the parameter scoring = 'balanced_accuracy' in gridsearchcv to make sure they calculate the same thing.
The score you get from gridsearchcv is the validation score (score measured on the part of X_train not used to train the model).
Your manually calculated score is the training score (you fit the model and evaluate the score on the same data: X_train).
The high difference is a sign of overfitting.
You can try to change param_grid :
param_grid = {
'n_estimators': [100, 200], # 500 seems high and might take too long for no reason
'max_features': ['sqrt','log2', None], # Less features can reduce overfitting
'max_depth': [3, 4, 5, 6 ], # Lower depth can reduce overfitting
'max_leaf_nodes': [4, 6, 8], # Lower max_leaf_nodes can reduce overfitting
'min_impurity_decrease':[0,0.02],
'min_samples_split':[2],
'min_samples_leaf': [5, 10, 20] # Higher values can reduce overfitting
}
Also using cv=3 or cv=5 in GridSearchCV could help.
See this post about solving Random Forest overfitting.

Random forest classifier vastly underperforming comparing to logistic regression, SVM, and naive bayes with cross-validation grid-search?

I have a multiclass classification problem. In the cross-validation grid-search to find the best (hyper)parameter settings, I found that the random forest is extremely under-performing (accuracy=0.412, while other ML algo reached 0.70 or higher). I understand this is not necessary a red flag as different ML algorithms may perform best/worst in different problem solution space. But I am wondering if I am not setting the range of its possible hyper-parameter correctly.
ml_algo_param_dict = \
{
'LR_V1': { 'clf': LogisticRegression(),
'param': {
'logisticregression__solver': ['liblinear'],
'logisticregression__penalty': ['l1', 'l2'],
'logisticregression__C': np.logspace(-4, 4, 20),
'logisticregression__tol': np.logspace(-5, 5, 20),
'logisticregression__class_weight': [None, 'balanced'],
'logisticregression__multi_class': ['ovr', 'auto'],
'logisticregression__max_iter': [4000, 20000],
}},
'LR_V2': { 'clf': LogisticRegression(),
'param': {
'logisticregression__solver': ['newton-cg', 'lbfgs', 'sag', 'saga'],
'logisticregression__penalty': ['none', 'l2'],
'logisticregression__C': np.logspace(-4, 4, 20),
'logisticregression__tol': np.logspace(-5, 5, 20),
'logisticregression__class_weight': [None, 'balanced'],
'logisticregression__multi_class': ['ovr', 'multinomial', 'auto'],
'logisticregression__max_iter': [4000, 20000],
}},
'SVC': { 'clf': OneVsRestClassifier(LinearSVC()),
'param': {
'onevsrestclassifier__estimator__penalty': ['l2'],
'onevsrestclassifier__estimator__loss': ['hinge', 'squared_hinge'],
'onevsrestclassifier__estimator__C': np.logspace(-4, 4, 20),
'onevsrestclassifier__estimator__tol': np.logspace(-5, 5, 20),
'onevsrestclassifier__estimator__class_weight': [None, 'balanced'],
'onevsrestclassifier__estimator__multi_class': ['ovr', 'crammer_singer'],
'onevsrestclassifier__estimator__max_iter': [50, 1000, 4000, 20000],
}},
'RF': {'clf': RandomForestClassifier(),
'param': {
'randomforestclassifier__n_estimators': [1, 2, 4, 8, 16, 32, 64, 100, 200, 500, ],
'randomforestclassifier__criterion': ['gini', 'entropy'],
'randomforestclassifier__class_weight': [None, 'balanced', 'balanced_subsample'],
'randomforestclassifier__max_depth': np.linspace(1, 10, 32, endpoint=True),
'randomforestclassifier__min_samples_split': np.linspace(0.1, 1.0, 10, endpoint=True),
'randomforestclassifier__min_samples_leaf': np.linspace(0.1, 0.5, 5, endpoint=True),
'randomforestclassifier__max_leaf_nodes': [None, 50, 100, 200, 400],
'randomforestclassifier__max_features': [None, 'auto', 'sqrt', 'log2'],
}},
'NB': {'clf': BernoulliNB(),
'param': {
'bernoullinb__alpha': np.logspace(-4, 4, 20),
'bernoullinb__binarize': [None, 0, .2, .4, .6, .8, 1],
'bernoullinb__fit_prior': [True, False],
}},
}
Result
>> Best score: 0.712
>> Best parameter:
Pipeline(memory=None,
steps=[('columntransformer',
ColumnTransformer(n_jobs=None, remainder='drop',
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('num',
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(add_indicator=False,
copy=True,
fill_value=None,
missing_values=nan,
strategy='median',
verbose=0)),
('scaler',
StandardScaler(copy=True,
with...
verbose=False),
['LOC_ENTITY_LIST'])],
verbose=False)),
('logisticregression',
LogisticRegression(C=0.03359818286283781,
class_weight='balanced', dual=False,
fit_intercept=True, intercept_scaling=1,
l1_ratio=None, max_iter=4000,
multi_class='ovr', n_jobs=None,
penalty='l2', random_state=None,
solver='liblinear', tol=1e-05, verbose=0,
warm_start=False))],
verbose=False)
>> Best selected parameter:
{'logisticregression__tol': 1e-05, 'logisticregression__solver': 'liblinear', 'logisticregression__penalty': 'l2', 'logisticregression__multi_class': 'ovr', 'logisticregression__max_iter': 4000, 'logisticregression__class_weight': 'balanced', 'logisticregression__C': 0.03359818286283781}
>> Best score: 0.738
>> Best parameter:
Pipeline(memory=None,
steps=[('columntransformer',
ColumnTransformer(n_jobs=None, remainder='drop',
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('num',
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(add_indicator=False,
copy=True,
fill_value=None,
missing_values=nan,
strategy='median',
verbose=0)),
('scaler',
StandardScaler(copy=True,
with...
verbose=False),
['LOC_ENTITY_LIST'])],
verbose=False)),
('logisticregression',
LogisticRegression(C=0.23357214690901212, class_weight=None,
dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None,
max_iter=20000, multi_class='ovr',
n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs',
tol=0.01438449888287663, verbose=0,
warm_start=False))],
verbose=False)
>> Best selected parameter:
{'logisticregression__tol': 0.01438449888287663, 'logisticregression__solver': 'lbfgs', 'logisticregression__penalty': 'l2', 'logisticregression__multi_class': 'ovr', 'logisticregression__max_iter': 20000, 'logisticregression__class_weight': None, 'logisticregression__C': 0.23357214690901212}
>> Best score: 0.708
>> Best parameter:
Pipeline(memory=None,
steps=[('columntransformer',
ColumnTransformer(n_jobs=None, remainder='drop',
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('num',
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(add_indicator=False,
copy=True,
fill_value=None,
missing_values=nan,
strategy='median',
verbose=0)),
('scaler',
StandardScaler(copy=True,
with...
verbose=False),
['LOC_ENTITY_LIST'])],
verbose=False)),
('onevsrestclassifier',
OneVsRestClassifier(estimator=LinearSVC(C=78.47599703514607,
class_weight='balanced',
dual=True,
fit_intercept=True,
intercept_scaling=1,
loss='hinge',
max_iter=4000,
multi_class='ovr',
penalty='l2',
random_state=None,
tol=3.359818286283781e-05,
verbose=0),
n_jobs=None))],
verbose=False)
>> Best selected parameter:
{'onevsrestclassifier__estimator__tol': 3.359818286283781e-05, 'onevsrestclassifier__estimator__penalty': 'l2', 'onevsrestclassifier__estimator__multi_class': 'ovr', 'onevsrestclassifier__estimator__max_iter': 4000, 'onevsrestclassifier__estimator__loss': 'hinge', 'onevsrestclassifier__estimator__class_weight': 'balanced', 'onevsrestclassifier__estimator__C': 78.47599703514607}
>> Best score: 0.412
>> Best parameter:
Pipeline(memory=None,
steps=[('columntransformer',
ColumnTransformer(n_jobs=None, remainder='drop',
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('num',
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(add_indicator=False,
copy=True,
fill_value=None,
missing_values=nan,
strategy='median',
verbose=0)),
('scaler',
StandardScaler(copy=True,
with...
RandomForestClassifier(bootstrap=True, class_weight=None,
criterion='gini',
max_depth=6.806451612903226,
max_features=None, max_leaf_nodes=50,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=0.2,
min_samples_split=0.2,
min_weight_fraction_leaf=0.0,
n_estimators=200, n_jobs=None,
oob_score=False, random_state=None,
verbose=0, warm_start=False))],
verbose=False)
>> Best selected parameter:
{'randomforestclassifier__n_estimators': 200, 'randomforestclassifier__min_samples_split': 0.2, 'randomforestclassifier__min_samples_leaf': 0.2, 'randomforestclassifier__max_leaf_nodes': 50, 'randomforestclassifier__max_features': None, 'randomforestclassifier__max_depth': 6.80645161290322 6, 'randomforestclassifier__criterion': 'gini', 'randomforestclassifier__class_weight': None}
>> Best score: 0.697
>> Best parameter:
Pipeline(memory=None,
steps=[('columntransformer',
ColumnTransformer(n_jobs=None, remainder='drop',
sparse_threshold=0.3,
transformer_weights=None,
transformers=[('num',
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(add_indicator=False,
copy=True,
fill_value=None,
missing_values=nan,
strategy='median',
verbose=0)),
('scaler',
StandardScaler(copy=True,
with...
lowercase=True,
max_df=1.0,
max_features=5000,
min_df=1,
ngram_range=(1,
1),
preprocessor=None,
stop_words=None,
strip_accents=None,
token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None,
vocabulary=None))],
verbose=False),
['LOC_ENTITY_LIST'])],
verbose=False)),
('bernoullinb',
BernoulliNB(alpha=0.00026366508987303583, binarize=0.6,
class_prior=None, fit_prior=True))],
verbose=False)
>> Best selected parameter:
{'bernoullinb__fit_prior': True, 'bernoullinb__binarize': 0.6, 'bernoullinb__alpha': 0.00026366508987303583}
Any suggestion of what to do/test next and the explanation of why this is the case will be most appreciated.
In machine learning and deep learning for either discrete/generative model, we always think and remember what Occam Razor said, “when presented with competing hypothetical answers to a problem, one should select the one that makes the fewest assumptions”. Which can be translated as:
Models with least complexity (i.e. hyper-parameters, features, etc.) tend to perform better than one with higher complexity.
It is always recommend to start with least complex model and then partially increase toward more elaborates one (with higher hyper-parameters, more assumptions, etc.)
A Random Forest is already a very complex model, which involves a bagging ensemble method of multiple Decision Trees.
My recommendation would be to try a very simple Decision Tree and update its hyper -parameters if you can obtain a better or similar performance than you other models, then I will try add more complexity and very simple RF with minimal hyper-parameters such as 2 < randomforestclassifier__max_leaf_node < 10

decision_function in example of scikit-learn

I can't understand the code. What does it doing?
clf.decision_function([[1]])
I read scikit-learn.org and I couldn't understand it.
X = [[0], [1], [2], [3]]
Y = [0, 1, 2, 3]
clf = svm.SVC(gamma='scale', decision_function_shape='ovo')
clf.fit(X, Y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovo', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
dec = clf.decision_function([[1]])
dec.shape[1] # 4 classes: 4*3/2 = 6
6
clf.decision_function_shape = "ovr"
dec = clf.decision_function([[1]])
dec.shape[1] # 4 classes
4

Error in mlxtend voting regressor while using random forest for feature selection

from mlxtend.regressor import StackingRegressor
from sklearn.ensemble.forest import RandomForestRegressor as RFR
from sklearn.ensemble import GradientBoostingRegressor as GBR
import xgboost as xgb
rfr = RFR(n_estimators=500, n_jobs=cc.ncpu, random_state=0)
gbr = GBR(n_estimators=1000, random_state=0)
xgr = xgb.XGBRegressor()
mtr = RFR() # meta regressor
regressors = [rfr, gbr, xgr]
model = StackingRegressor(regressors=regressors, meta_regressor=mtr)
param_grid = {
'fs__threshold': ['median'],
'fs__estimator__max_features': ['log2'],
'clf__rfr__max_features': ['auto', 'log2'],
'clf__gbr__learning_rate': [0.05, 0.02, 0.01],
'clf__gbr__max_depth': [4, 5, 6, 7],
'clf__gbr__max_features': ['auto', 'log2'],
'clf__gbr__n_estimators': [500, 1000, 2000],
'clf__xgr__learning_rate': [0.001, 0.05, 0.1, 0.2],
'clf__xgr__max_depth': [2, 4, 6],
'clf__xgr__min_child_weight': [1, 3, 5],
'clf__xgr__n_estimators': [500, 1000],
'clf__meta-mtr__n_estimators': [750, 1500]
}
rf_feature_imp = RFR(250, n_jobs=cc.ncpu)
feat_selection = SelectFromModel(rf_feature_imp)
pipeline = Pipeline([('fs', feat_selection), ('clf', model), ])
gs = GridSearchCV(pipeline, param_grid=param_grid, verbose=1, n_jobs=-1, error_score=np.nan)
In the code above, I want to use the mlxtend voting regressor and also use a random forest to select relevant features. However, this code is not working and I get an error
ValueError: Invalid parameter xgr for estimator StackingRegressor(meta_regressor=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False),
regressors=[RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=500, n_jobs=5, oob_sc...eg:linear', reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=0, silent=True, subsample=1)],
verbose=0). Check the list of available parameters with `estimator.get_params().keys()`.
How to fix this?

Scikit - 3D feature array for SVM

I am trying to train an SVM in scikit. I am following the example and tried to adjust it to my 3d feature vectors.
I tried the example from the page http://scikit-learn.org/stable/modules/svm.html
and it ran through. While bugfixing I came back to the tutorial setup and found this:
X = [[0, 0], [1, 1],[2,2]]
y = [0, 1,1]
clf = svm.SVC()
clf.fit(X, y)
works while
X = [[0, 0,0], [1, 1,1],[2,2,2]]
y = [0, 1,1]
clf = svm.SVC()
clf.fit(X, y)
fails with:
ValueError: X.shape[1] = 2 should be equal to 3, the number of features at training time
what is wrong here? It's only one additional dimension...
Thanks,
El
Running your latter code works for me:
>>> X = [[0,0,0], [1,1,1], [2,2,2]]
>>> y = [0,1,1]
>>> clf = svm.SVC()
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=False, shrinking=True, tol=0.001,
verbose=False)
That error message seems like it should actually happen when you're calling .predict() on an SVM object with kernel="precomputed". Is that the case?

Resources