_score() missing 1 required argument 'y_true' - python-3.x

I need to run AgglomerativeClustering method and my The code is,
model = AgglomerativeClustering()
params = {"n_clusters": [2,3,4]}
clf = GridSearchCV(model, params, n_jobs=1, cv=10, refit=False)
clf.fit(self.data, None)
Then I get the error saying "_score() missing 1 required argument 'y_true'". However for clustering there is no y. Any solution?

As you dont specifiy your scoring function in GridSearchCV it will use (based on the docs):
If None, the estimator’s score method is used.
Your estimator is model so it is: AllgomerativeClustering, and if we check up this docs, there is no scoring function in it. On top you are missing the train/test split fo this kind of algorithm. Currently GridSearchCV is not designed to use clusterign algorithm, you can have a check here how to continue.

Related

how to cross validate pca in sklearn pipeline without overfitting?

My input is time series data. I want to decompose the dataset with PCA (I dont want to do PCA on the entire dataset first because that would be overfitting) and then use feature selection on each component (fitted on a KNN Regressor model).
This is my code so far:
tscv = TimeSeriesSplit(n_splits=10)
pca = PCA(n_components=.5,svd_solver='full').fit_transform()
knn = KNeighborsRegressor(n_jobs=-1)
sfs = SequentialFeatureSelector(estimator=knn,n_features_to_select='auto',tol=.001,scoring=custom_scorer,n_jobs=-1)
pipe = Pipeline(steps=[("pca", pca), ("sfs", sfs), ("knn", knn)])
cv_score = cross_val_score(estimator=pipe,X=X,y=y,scoring=custom_scorer,cv=tscv,verbose=10)
print(np.average(cv_score),' +/- ',np.std(cv_score))
print(X.columns)
The problem is I want to make sure PCA isnt looking over the entire dataset when it calculates which features variance. I also want it to be fit transformed, but it doesnt work. With the following error codes:
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '<bound method PCA.fit_transform of PCA(svd_solver='full')>' (type <class 'method'>) doesn't
or
TypeError: fit_transform() missing 1 required positional argument: 'X'
You should not use pca = PCA(...).fit_transform nor pca = PCA(...).fit_transform() when defining your pipeline.
Instead, you should use pca = PCA(...). The fit_transform method is automatically called within the pipeline during the model fitting (in cross_val_score).

Getting a Scoring Function by Name in scikit-learn

In scikit-learn , there is the notion of a scoring function. If we have some predicted labels and the true labels, we can get to the score by calling scoring(y_true, y_predict). An example of such scoring function is sklearn.metrics.accuracy_score.
A scoring function is not to be confused of the scorer, which is an object that can be called as scorer(estimator, X, y_true).
There are many builtin scorers in scikit-learn. It is possible to get to these scorers by their string names. For example, we can get the scorer corresponding to the name 'accuracy' by calling sklearn.metrics.get_scorer("accuracy")/
But it turns out that there is no obvious mechanism to access the built-in scoring functions by their names at run-time, through passing in the name as a string. For example, there is no way to access sklearn.metrics.accuracy_score by its name accuracy.
For example, if at run time, the program knows the name of the scoring function is contained in variable name, I am looking for a mechanism get_scoring_function(), such that, get_scoring_function(name) will return the scoring function handle. Note that this name, name, is not known at scripting time.
Is there any way to access the built-in scoring functions by their names at run time through passing in the names as strings?
You can use the get_scorer() function, which accepts a string as an argument, and then get the _score_func attribute of the returned object.
So for example
from sklearn.metrics import get_scorer
get_scorer('accuracy')._score_func(y_true, y_pred)
is equivalent to
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)
I myself faced this task, and I haven't found a better way to access metrics by names than with sklearn.metrics.get_scorer function, but the drawback of it is that you have to pass an estimator there, not predictions. I tried to use the #collinb9 recommendation, but you see, you have to access a protected method there, and in my case, it led to unpleasant consequences, namely incorrectly calculated metrics.
This is a short example showing this problem.
from sklearn import datasets, model_selection, linear_model, metrics
features, labels = datasets.make_regression(1000, random_state=123)
train_features, test_features, train_labels, test_labels = model_selection.train_test_split(features, labels, test_size=0.1, random_state=567)
model = linear_model.LinearRegression()
model.fit(train_features, train_labels)
print(f'variant 1 neg_mse = {metrics.get_scorer("neg_mean_squared_error")(model, test_features, test_labels)}')
print(f'variant 1 neg_rmse = {metrics.get_scorer("neg_root_mean_squared_error")(model, test_features, test_labels)}\n')
preds = model.predict(test_features)
print(f'variant 2 mse = {metrics.mean_squared_error(test_labels, preds)}')
print(f'variant 2 rmse = {metrics.mean_squared_error(test_labels, preds, squared=False)}\n')
print(f'protected neg_mse = {metrics.get_scorer("neg_mean_squared_error")._score_func(test_labels, preds)}')
print(f'protected neg_rmse = {metrics.get_scorer("neg_root_mean_squared_error")._score_func(test_labels, preds)}')
The output of this program will be:
variant 1 neg_mse = -2.142587870436064e-25
variant 1 neg_rmse = -4.628809642268803e-13
variant 2 mse = 2.142587870436064e-25
variant 2 rmse = 4.628809642268803e-13
protected neg_mse = 2.142587870436064e-25
protected neg_rmse = 2.142587870436064e-25
You see, metrics calculated with the use of the protected method differ. First, we ordered to get negative values, but got positive ones (it should be mentioned, that for variant 2 metrics we didn't imply negative values). Second, the neg_mse and neg_rmse values are equal but should be different.
If we go to the source code of sklearn metrics, we will see:
This is how _score_func is called: it is multiplied by sign, so that's where we lose our negative values.
This is how scorers are made: you see, neg_root_mean_squared_error_scorer has extra parameter squared=False. This parameter is stated explicitly as an optional one in metrics.mean_squared_error, so you won't make a mistake. We can pass this parameter as a keyword argument to _score_fun and at least we will get a correct absolute value then:
print(f'protected neg_rmse = {metrics.get_scorer("neg_root_mean_squared_error")._score_func(test_labels, preds, squared=False)}')
protected neg_rmse = 4.628809642268803e-13
To make things short, I've shown, to my knowledge, the only way to get sklearn metrics by name (btw, you can find the full list of names here), and that it's not safe to use protected methods that you're not supposed to use. BTW, I was using sklearn version=0.24.2.
Since the documentation is incomplete, you'll have to go directly to the source code here for the complete list of metric names:
Metric Names
Search for __all__.
Answer of #collinb9 should not be accepted as it would lead to incorrect calculations.
You need other arguments (such as squared:False for rmse) to compute the correct thing. They can be accessed via the _kwargs attribute of _BaseScorer class. If you combine _score_func and _kwargs then we can get the corresponding scorer function.
The full answer to the question should be:
import functools
import sklearn
def score(scoring_name, y_true, y_pred):
sklearn_scorer = sklearn.metrics.get_scorer(scoring_name)
return sklearn_scorer._sign * sklearn_scorer._score_func(
y_true=y_true, y_pred=y_pred, **sklearn_scorer._kwargs
)
score("neg_root_mean_squared_error", y_true, y_pred)

Gaussian Process Regressor scikit learn does not recognise 'eval_MSE=True'

I am using Gaussian Process Regressor scikit learn to predit data for the models. While using gp, I also need to find uncertainity in each value present in the dataset. The documentation suggests to use "gp.predict(self, X, eval_MSE=True)". I used the same 'eval_MSE' in the code available online to test, but it gives me this error.
TypeError: predict() got an unexpected keyword argument 'eval_MSE'
The code I used for testing:
gp = GaussianProcessRegressor(corr='squared_exponential', theta0=1e-1,
thetaL=1e-3, thetaU=1,
nugget=(dy / y) ** 2,
random_start=100)
gp.fit(X, y)
y_pred, MSE = gp.predict(x, eval_MSE=True)
sigma = np.sqrt(MSE)
Can anyone provide a solution for this?
Either go back to previous version of scikitlearn: GaussianProcess.predict
... or you adapt to the newest: GaussianProcessClassifier.predict
Not only the predict arguments has changed, but the name of the classifier itself, the input arguments, etc.
Summary from previous links:
Old GaussianProcess (version 0.17):
class sklearn.gaussian_process.GaussianProcess(regr='constant', corr='squared_exponential', beta0=None, storage_mode='full', verbose=False, theta0=0.1, thetaL=None, thetaU=None, optimizer='fmin_cobyla', random_start=1, normalize=True, nugget=2.2204460492503131e-15, random_state=None)
predict(X, eval_MSE=False, batch_size=None)
New GaussianProcessClassifier:
class sklearn.gaussian_process.GaussianProcessClassifier(kernel=None, *, optimizer='fmin_l_bfgs_b', n_restarts_optimizer=0, max_iter_predict=100, warm_start=False, copy_X_train=True, random_state=None, multi_class='one_vs_rest', n_jobs=None)
predict(X)

How to use cross_val_predict to predict probabilities for a new dataset?

I am using sklearn's cross_val_predict for training like so:
myprobs_train = cross_val_predict(LogisticRegression(),X = x_old, y=y_old, method='predict_proba', cv=10)
I am happy with the returned probabilities, and would like now to score up a brand-new dataset. I tried:
myprobs_test = cross_val_predict(LogisticRegression(), X =x_new, y= None, method='predict_proba',cv=10)
but this did not work, it's complaining about y having zero shape. Does it mean there's no way to apply the trained and cross-validated model from cross_val_predict on new data? Or am I just using it wrong?
Thank you!
You are looking at a wrong method. Cross validation methods do not return a trained model; they return values that evaluate the performance of a model (logistic regression in your case). Your goal is to fit some data and then generate prediction for new data. The relevant methods are fit and predict of the LogisticRegression class. Here is the basic structure:
logreg = linear_model.LogisticRegression()
logreg.fit(x_old, y_old)
predictions = logreg.predict(x_new)
I have the same concern as #user3490622. If we can only use cross_val_predict on training and testing sets, why y (target) is None as the default value? (sklearn page)
To partially achieve the desired results of multiple predicted probability, one could use the fit then predict approach repeatedly to mimic the cross-validation.

GridsearchCV with RandomForest

So I am doing some parameter thing with RandomForest and GridsearchCV. Here is my code.
#Import 'GridSearchCV' and 'make_scorer'
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
Create the parameters list you wish to tune
parameters = {'n_estimators':[5,10,15]}
#Initialize the classifier
clf = GridSearchCV(RandomForestClassifier(), parameters)
#Make an f1 scoring function using 'make_scorer'
f1_scorer = make_scorer(f1_scorer)
#Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(clf, param_grid=parameters, scoring=f1_scorer,cv=5)
print(clf.get_params().keys())
#Fit the grid search object to the training data and find the optimal parameters
grid_obj = grid_obj.fit(X_train_100,y_train_100)
So the issue is the following error: "ValueError: Invalid parameter max_features for estimator GridSearchCV. Check the list of available parameters with estimator.get_params().keys()."
I followed the advice given by the error and the output of print(clf.get_params().keys()) is below. However even when I copy and paste these titles into my parameter dictionary I still get an error. I've hunted around stack overflow and most people are using really similar parameter dictionaries to mine. Anyone have any idea on how to iron out this issue? Thanks again!
dict_keys(['pre_dispatch', 'cv', 'estimator__max_features', 'param_grid', 'refit', 'estimator__min_impurity_split', 'n_jobs', 'estimator__random_state', 'error_score', 'verbose', 'estimator__min_samples_split', 'estimator__n_jobs', 'fit_params', 'estimator__min_weight_fraction_leaf', 'scoring', 'estimator__warm_start', 'estimator__criterion', 'estimator__verbose', 'estimator__bootstrap', 'estimator__class_weight', 'estimator__oob_score', 'iid', 'estimator', 'estimator__max_depth', 'estimator__max_leaf_nodes', 'estimator__min_samples_leaf', 'estimator__n_estimators', 'return_train_score'])
I think the problem is with the two lines:
clf = GridSearchCV(RandomForestClassifier(), parameters)
grid_obj = GridSearchCV(clf, param_grid=parameters, scoring=f1_scorer,cv=5)
What this is essentially doing is creating an object with a structure like:
grid_obj = GridSearchCV(GridSearchCV(RandomForestClassifier()))
which is probably one more GridSearchCV than you want.

Resources