decision_function in example of scikit-learn - scikit-learn

I can't understand the code. What does it doing?
clf.decision_function([[1]])
I read scikit-learn.org and I couldn't understand it.
X = [[0], [1], [2], [3]]
Y = [0, 1, 2, 3]
clf = svm.SVC(gamma='scale', decision_function_shape='ovo')
clf.fit(X, Y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovo', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
dec = clf.decision_function([[1]])
dec.shape[1] # 4 classes: 4*3/2 = 6
6
clf.decision_function_shape = "ovr"
dec = clf.decision_function([[1]])
dec.shape[1] # 4 classes
4

Related

BlockingTimeSeriesSplit returnd best score nan

I've used TimeSeriesSplit from sklearn and a customized BlockingTimeSeriesSplit with a GridSearchCV object to tune a XGB model (pls check an example from this link),
import xgboost as xgb
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
import numpy as np
class BlockingTimeSeriesSplit():
def __init__(self, n_splits):
self.n_splits = n_splits
def get_n_splits(self, X, y, groups):
return self.n_splits
def split(self, X, y=None, groups=None):
n_samples = len(X)
k_fold_size = n_samples // self.n_splits
indices = np.arange(n_samples)
margin = 0
for i in range(self.n_splits):
start = i * k_fold_size
stop = start + k_fold_size
mid = int(0.8 * (stop - start)) + start
yield indices[start: mid], indices[mid + margin: stop]
for k_fold in range(2, 4, 1):
print(k_fold)
print('*'*50)
sklearn_tscv = BlockingTimeSeriesSplit(n_splits=k_fold)
blocked_tscv = TimeSeriesSplit(n_splits=k_fold)
split_methods = [sklearn_tscv, blocked_tscv]
for split_method in split_methods:
X = np.array([[4, 5, 6, 1, 0, 2, 7, 3, 5], [3.1, 3.5, 1.0, 2.1, 8.3, 1.1, 2, 4, 6]]).T
y = np.array([1, 6, 7, 1, 2, 3, 1, 3, 7])
model = xgb.XGBRegressor()
param_search = {'max_depth' : [3, 5]}
tscv = TimeSeriesSplit(n_splits=2)
gsearch = GridSearchCV(estimator=model, cv=split_method,
param_grid=param_search)
gsearch.fit(X, y)
print(gsearch.fit(X, y))
print('Best score reached: {} with params: {} '.format(gsearch.best_score_, gsearch.best_params_))
Out:
2
**************************************************
GridSearchCV(cv=<__main__.BlockingTimeSeriesSplit object at 0x0000016303A6A070>,
estimator=XGBRegressor(base_score=None, booster=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, gamma=None,
gpu_id=None, importance_type='gain',
interaction_constraints=None,
learning_rate=None, max_delta_step=None,
max_depth=None, min_child_weight=None,
missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None, random_state=None,
reg_alpha=None, reg_lambda=None,
scale_pos_weight=None, subsample=None,
tree_method=None, validate_parameters=None,
verbosity=None),
param_grid={'max_depth': [3, 5]})
Best score reached: nan with params: {'max_depth': 3}
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=2, test_size=None),
estimator=XGBRegressor(base_score=None, booster=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, gamma=None,
gpu_id=None, importance_type='gain',
interaction_constraints=None,
learning_rate=None, max_delta_step=None,
max_depth=None, min_child_weight=None,
missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None, random_state=None,
reg_alpha=None, reg_lambda=None,
scale_pos_weight=None, subsample=None,
tree_method=None, validate_parameters=None,
verbosity=None),
param_grid={'max_depth': [3, 5]})
Best score reached: -1.1467339258845848 with params: {'max_depth': 3}
3
**************************************************
GridSearchCV(cv=<__main__.BlockingTimeSeriesSplit object at 0x0000016303AF39A0>,
estimator=XGBRegressor(base_score=None, booster=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, gamma=None,
gpu_id=None, importance_type='gain',
interaction_constraints=None,
learning_rate=None, max_delta_step=None,
max_depth=None, min_child_weight=None,
missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None, random_state=None,
reg_alpha=None, reg_lambda=None,
scale_pos_weight=None, subsample=None,
tree_method=None, validate_parameters=None,
verbosity=None),
param_grid={'max_depth': [3, 5]})
Best score reached: nan with params: {'max_depth': 3}
GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=3, test_size=None),
estimator=XGBRegressor(base_score=None, booster=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, gamma=None,
gpu_id=None, importance_type='gain',
interaction_constraints=None,
learning_rate=None, max_delta_step=None,
max_depth=None, min_child_weight=None,
missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None, random_state=None,
reg_alpha=None, reg_lambda=None,
scale_pos_weight=None, subsample=None,
tree_method=None, validate_parameters=None,
verbosity=None),
param_grid={'max_depth': [3, 5]})
Best score reached: -6.440679569353013 with params: {'max_depth': 3}
But as you can see, BlockingTimeSeriesSplit always returns Best score reached: nan with params: {'max_depth': 3} .
I would like to ask if this result is because the dataset is too small or something else. Thanks.
Reference:
https://hub.packtpub.com/cross-validation-strategies-for-time-series-forecasting-tutorial/

Check the list of available parameters with `estimator.get_params().keys()`

When I try to run a RandomForestClassifier with Pipeline and param_grid:
pipeline = Pipeline([("scaler" , StandardScaler()),
("rf",RandomForestClassifier())])
from sklearn.model_selection import GridSearchCV
param_grid = {
'max_depth': [4, 5, 10],
'max_features': [2, 3],
'min_samples_leaf': [3, 4, 5],
'n_estimators': [100, 200, 300]
}
# initialize
grid_pipeline = GridSearchCV(pipeline,param_grid,n_jobs=-1, verbose=1, cv=3, scoring='f1')
# fit
grid_pipeline.fit(X_train,y_train)
grid_pipeline.best_params_
I get the following error:
ValueError: Invalid parameter max_depth for estimator Pipeline(memory=None,
steps=[('scaler',
StandardScaler(copy=True, with_mean=True, with_std=True)),
('rf',
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
class_weight=None, criterion='gini',
max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None,
oob_score=False, random_state=None,
verbose=0, warm_start=False))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
Although I have reviewed the scikit learn documentation and several posts, I can't find the error in my code.
When you use a pipeline with GridSearchCV() you must include names in parameter keys. Just separate names from parameter names with a double underscore. In your case:
param_grid = {
'rf__max_depth': [4, 5, 10],
'rf__max_features': [2, 3],
'rf__min_samples_leaf': [3, 4, 5],
'rf__n_estimators': [100, 200, 300]
}
Example from sklearn documentation: https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html

Why SVC repeates it parameter in the output

My aim is to classify the extracted features from CNN using support vector machine.
Extracted features have a shape (2186, 128), which is an np array saved in X_tr.
Y has the shape (2186,) an array([0, 0, 0, ..., 0, 0, 0])
Applying these to SVC.
Input:
from sklearn.svm import SVC
clf = SVC()
clf.fit(X_train, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Output:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Why is it giving parameters as output instead of classification?
What are you expecting to see? You have trained the classifier on your training data and now you need to evaluate the classifier on your test data. In scikit-learn, you train a classifier using:
clf.fit(X_train, y_train)
and you make a prediction with the trained classifier using something like:
predictions = clf.predict(X_test)
from sklearn.svm import SVC
clf = SVC()
clf.fit(X_train, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
wont give any output.
If you want to test classification and prediction then use
from sklearn.svm import SVC
clf = SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
clf.fit(X_train, y)
pred = clf.predict(X_test)
print pred
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
In SVC C, cache_size, class_weight etc. these are the parameters which SVC takes. These parameters you can use for tuning like you want to use 'linear' or 'rbf' kernel with 'C:1000'.
For more info please check: http://scikit-learn.org/stable/modules/svm.html

Error in mlxtend voting regressor while using random forest for feature selection

from mlxtend.regressor import StackingRegressor
from sklearn.ensemble.forest import RandomForestRegressor as RFR
from sklearn.ensemble import GradientBoostingRegressor as GBR
import xgboost as xgb
rfr = RFR(n_estimators=500, n_jobs=cc.ncpu, random_state=0)
gbr = GBR(n_estimators=1000, random_state=0)
xgr = xgb.XGBRegressor()
mtr = RFR() # meta regressor
regressors = [rfr, gbr, xgr]
model = StackingRegressor(regressors=regressors, meta_regressor=mtr)
param_grid = {
'fs__threshold': ['median'],
'fs__estimator__max_features': ['log2'],
'clf__rfr__max_features': ['auto', 'log2'],
'clf__gbr__learning_rate': [0.05, 0.02, 0.01],
'clf__gbr__max_depth': [4, 5, 6, 7],
'clf__gbr__max_features': ['auto', 'log2'],
'clf__gbr__n_estimators': [500, 1000, 2000],
'clf__xgr__learning_rate': [0.001, 0.05, 0.1, 0.2],
'clf__xgr__max_depth': [2, 4, 6],
'clf__xgr__min_child_weight': [1, 3, 5],
'clf__xgr__n_estimators': [500, 1000],
'clf__meta-mtr__n_estimators': [750, 1500]
}
rf_feature_imp = RFR(250, n_jobs=cc.ncpu)
feat_selection = SelectFromModel(rf_feature_imp)
pipeline = Pipeline([('fs', feat_selection), ('clf', model), ])
gs = GridSearchCV(pipeline, param_grid=param_grid, verbose=1, n_jobs=-1, error_score=np.nan)
In the code above, I want to use the mlxtend voting regressor and also use a random forest to select relevant features. However, this code is not working and I get an error
ValueError: Invalid parameter xgr for estimator StackingRegressor(meta_regressor=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False),
regressors=[RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=500, n_jobs=5, oob_sc...eg:linear', reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=0, silent=True, subsample=1)],
verbose=0). Check the list of available parameters with `estimator.get_params().keys()`.
How to fix this?

Scikit - 3D feature array for SVM

I am trying to train an SVM in scikit. I am following the example and tried to adjust it to my 3d feature vectors.
I tried the example from the page http://scikit-learn.org/stable/modules/svm.html
and it ran through. While bugfixing I came back to the tutorial setup and found this:
X = [[0, 0], [1, 1],[2,2]]
y = [0, 1,1]
clf = svm.SVC()
clf.fit(X, y)
works while
X = [[0, 0,0], [1, 1,1],[2,2,2]]
y = [0, 1,1]
clf = svm.SVC()
clf.fit(X, y)
fails with:
ValueError: X.shape[1] = 2 should be equal to 3, the number of features at training time
what is wrong here? It's only one additional dimension...
Thanks,
El
Running your latter code works for me:
>>> X = [[0,0,0], [1,1,1], [2,2,2]]
>>> y = [0,1,1]
>>> clf = svm.SVC()
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=False, shrinking=True, tol=0.001,
verbose=False)
That error message seems like it should actually happen when you're calling .predict() on an SVM object with kernel="precomputed". Is that the case?

Resources