How do we get 3 prediction of regressions in one pipeline? - scikit-learn

I am using SKlearn QuantileRegressor with the quantile values of [0.5, 0.1, 0.9] to get the predicted values with the upper and the lower confidence level of prediction. I am also using TransformedTargetRegressor to take care of target transformation. To get 3 prediction values in the Pipeline, I attempted to use FeatureUnion for parallel estimations
('scaler', StandardScaler())
('estimator', FeatureUnion([
('regressor', TransformedTargetRegressor(regressor=QuantileRegression(quantile=0.50), func=log_func, inverse_func=exp_func,check_inverse=True)),
('regressor_lower', TransformedTargetRegressor(regressor=QuantileRegression(quantile=0.10), func=log_func, inverse_func=exp_func,check_inverse=True)),
('regressor_upper', TransformedTargetRegressor(regressor=QuantileRegression(quantile=0.90), func=log_func, inverse_func=exp_func,check_inverse=True))
]))
])
I get error All estimators should implement fit and transform. What is the alternative to get prediction of three regressors.

Related

F1 metric and LeaveOneOut validation strategy in scikit-learn

I want to use GridSearchCV to find the optimal n_neighbors parameter of KNeighborsClassifier
I want to use 'f1_score' metrics AND 'leave one out' strategy.
But this code
clf = GridSearchCV(KNeighborsClassifier(), {'n_neighbors': [1, 2, 3]}, cv=LeaveOneOut(), scoring='f1')
clf.fit(x_train, y_train)
leads to an error
UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
I want to compute f1 score not of each fold of cross validation (it is not possible to compute f1 score of the only one test example), but to compute f1 score based on the whole iteration set with n_neighbors = n.
Is it possible using GridSearchCV?
Not sure if this functionality is directly available in Scikit-Learn, but you can implement the following function to get the desired outcome.
In particular, we will make a dummy scorer which just returns the predicted class instead of computing any score using the ground-truth and the prediction. In this way we can access the predictions of each hyperparameters combination on the different examples in the LOO cv.
from sklearn.metrics import f1_score, make_scorer
def get_pred(y_true, y_predicted):
return y_predicted
get_pred_scorer = make_scorer(get_pred)
clf = GridSearchCV(
KNeighborsClassifier(),
{'n_neighbors': [1, 2, 3]},
cv=LeaveOneOut(),
refit=False,
scoring=get_pred_scorer
)
clf.fit(X_train, y_train)
The problem with this approach is that certain results available in the cv_results_ dictionary (and in certain attributes of GridSearchCV) won't have any meaning, but that probably is not a problem. We should just remember to put refit=False, since GridSearchCV doesn't have a way to determine the best model.
Now we can access the predictions through cv_results_ and just use f1_score to compute the metric for each hyperparams configuration.
def print_params_f1_scores(clf, y_true):
y_preds = [] # will contain the predictions of each params combination
results = clf.cv_results_
params = results["params"] # all params combinations
for j in range(len(params)): # for each combination
y_preds.append([])
for i in range(clf.n_splits_): # for each split (sample in loo)
prediction_of_j_on_i = results[f"split{i}_test_score"][j]
y_preds[j].append(prediction_of_j_on_i)
# show the f1-scores of each combination
for j in range(len(y_preds)):
score = f1_score(y_true, y_preds[j])
print(f"KNeighborsClassifier with {params[j]} obtained f1-score of {score}")
print_params_f1_scores(clf, y_train)
The function prints the following output:
KNeighborsClassifier with {'n_neighbors': 1} obtained f1-score of 0.94
KNeighborsClassifier with {'n_neighbors': 2} obtained f1-score of 0.94
KNeighborsClassifier with {'n_neighbors': 3} obtained f1-score of 0.92

predict_proba() Logistic Regression when predicting a single value

I want to use Logistic Regression to predict a class (-1 or +1) given a data set which I split as follows (only a single entry is to be predicted in the test set):
x_train, x_test = loc_indep[:-1], loc_indep[-1:]
y_train, y_test = loc_target[:-1], loc_target[-1:]
Then I use the following to train the model:
regr = LogisticRegression()
regr.fit(x_train, y_train)
predictions = regr.predict(x_test)
probabilities = regr.predict_proba(x_test)
print(probabilities) # prints probabilities
Given the above, the probabilities always prints either [1. 0.] or [0. 1.], meaning that either class +1 or class -1 are picked with the probability 100%. Why is this the case? I expected that the probabilities sum to 1, but that the model picks, say, class +1 with probability 54%.
Your code seems to be correct. So this means you have a supper accurate model (which makes me suspect that something is wrong...). I will recommend check your training data, maybe you have some variable, by error, that explains too much (for example the same output).
Also try to output the train an test accuracy. If train accuracy is 100% and test accuracy is much lower, you are over fitting. Then you will have to change some hyperparameters to avoid it.
To conclude, try to understand your data, maybe it's super easy to differentiate both classes, and maybe for this reason you obtain such a good model.

What is the meaning of scikit-learn GridSearchCV best_score_ when GridSearchCV is used with KerasRegressor?

I use scikit_learn.GridSearchCV to grid search Hyperparameters for my Keras neural network (for a regression problem). The output of my neural network is a real value:
#generate a model (createModel is a function which returns a keras.Sequential model)
model = keras.wrappers.scikit_learn.KerasRegressor(build_fn=createModel)
#run the GridSearch
paramGrid = dict( epochs=[100, 250, 500], batch_size=[16, 32, 64] )
grid = sklearn.model_selection.GridSearchCV(estimator=model, param_grid=paramGrid, n_jobs=1, cv=5)
#obtain and print the result (X, y are some data)
grid_result = grid.fit(X, y)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
I don't understand what exactly is the best_score_ member in the grid search result. Is it a gap between the theoritical values and the predicted values? This best_score_ is always negative (and quite big) on my examples, it doesn't make any sense to me.
When you don't pass a specific scoring metric, GridSearchCV will use the default score method of estimator.
In your example, you did not pass a metric to your grid search instance, so it will use the default score metric of KerasRegressor, which is the mean loss of predictions (according to the source code on GitHub). Hence, since you set cv=5, grid_result.best_score_ is the average of the mean loss on all 5 folds.
I suggest you set your own performance metric by passing a value for scoring. For example:
grid = sklearn.model_selection.GridSearchCV(estimator=model, param_grid=paramGrid,
scoring='roc_auc', n_jobs=1, cv=5)
You can find a list of all the supported metrics here. You can also define your own.

GridSearchCv Pipeline MultiOutputClassifier with XGBoostClassifier - how to pass early_stopping_rounds and eval_set?

I want to do multioutput prediction of labels and continuous data. My data consists of time series, one 10 time-points series of 30 observables per sample. I want to predict 10 labels that are binary, and 5 that are continuous, based on this data.
For the sake of simplicity I have flattened the time series data - ending up with one row per sample.
Since there are many labels to predict about the same system, and since there exists relationships between these, I want to use MutliOutputPrediction to do so. My idea is to divide the task into two parts; one for MultiOutputClassification, another for MultiOutputRegression.
I generally like XGBoost and wish to use it for this task, but of course I want to prevent overfitting when doing so. So I have a piece of code as follows, and I wish to pass the early_stopping_rounds to the fit method of the XGBClassifier, but don't know how to.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
pipeline = Pipeline([
('imputer', SimpleImputer()), # XGBoost can deal with NaNs, but MultiOutputClassifier cannot
('classifier', MultiOutputClassifier(XGBClassifier()))
])
param_grid = dict(
classifier__estimator__n_estimators=[100], # this works
# classifier__estimator__early_stopping_rounds=[30], # needs to be passed to .fit
# classifier__estimator__scale_pos_weight=[scale_pos_weight], # XGBoostError: Invalid Parameter format for scale_pos_weight expect float
)
clf = GridSearchCV(estimator=pipeline, param_grid=param_grid, scoring='roc_auc', refit='roc_auc', cv=5, n_jobs=-1)
clf.fit(X_train, y_train[CLASSIFICATION_LABELS])
y_hat_proba = np.array(clf.predict_proba(X_test))
y_hat = pd.DataFrame(np.array([y_hat_proba[:, i, 0] for i in range(y_hat_proba.shape[1])]), columns=CLASSIFICATION_LABELS)
auc_roc_scores = np.array([roc_auc_score(y_test[label], (y_hat[label] > 0.5).astype(int)) for label in y_hat.columns])
print(f'average ROC AUC score: {np.mean(auc_roc_scores).round(3)}+/-{np.std(auc_roc_scores).round(3)}')
>>> average ROC AUC score: 0.499+/-0.002
I tried passing it to fit as follows:
classifier__estimator__early_stopping_rounds=30
classifier__early_stopping_rounds=30
I get AUC ROC scores of 0.5 on the labels, which means this clearly isn't working and hence why I want to pass the early_stopping_rounds parameter and the eval_set. I suppose that being able to pass scale_pos_weight could also be useful, but probably doesn't work for MultiOutput prediction. At the moment I get the feeling that this is not the way to go to solve this, and in case you agree I would appreciate alternative suggestions.

Error message on attempting to fit training data using GridSearch function

from sklearn.preprocessing import PolynomialFeatures
polyreg = PolynomialFeatures(degree = 4)
param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
grid_search_polyreg = GridSearchCV(polyreg, param_grid, cv = 5)
grid_search_polyreg.fit(x_train, y_train)
grid_search_polyreg.score(x_test, y_test)
print("Best Parameters for polynomial regression:
{}".format(grid_search_polyreg.best_params_))
print("Best Score for polynomial regression:
{:.2f}".format(grid_search_polyreg.best_score_))
TypeError: If no scoring is specified, the estimator passed should
have a 'score' method. The estimator PolynomialFeatures(degree=4,
include_bias=True, interaction_only=False) does not.
1)I understand that alpha is not a parameter for polynomial features. But when I tried to remove alpha and fit the data it did not work.
2) Does that mean that I am not supposed to use grid search for getting scores of KNN Regressor, Linear and kernel SVM?
I am new to python and any suggestion is much appreciated. Thanks in advance.
sklearn.preprocessing.PolynomialFeatures() doesn't have a scoring function. It's not actually an estimator or machine learning model, it just transforms a matrix. You can have it as part of your pipeline and test its parameters, but you have to pass an actual estimator with a scoring function to GridSearchCV.
Fitting to data has a different meaning when your dealing with transformers vs estimators, only in the latter case does it mean "train".

Resources