Storing multiple metrics in mlflow from a cross validation - mlflow

I have a logistic regression model for which I'm performing a repeated k-fold cross-validation and I'm wondering on the right way to track the the produced metrics in the mlfflow tracking api.
exp = mlflow.set_experiment("all_models_repeated_cross_validation_roc_auc")
with mlflow.start_run(experiment_id=exp.experiment_id):
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=random_state)
scoring = make_scorer(roc_auc_score, needs_proba=False, multi_class="ovr")
lr_scores = cross_val_score(lr, X_train, y_train, scoring=scoring, cv=rkf)
# log all the 50 metrics in mlfflow tracking api
What is the proper way to do that with mlflow? Is it storing it as an artifact?

log_metric cannot store an entire array of values. One option is indeed storing it as an artifact and associcating it with a model, another is storing it as a metric using the step argument to index the value in the i-th fold

Related

GridSearchCV, Data Leaks & Production Process Clarity

I've read a bit about integrating scaling with cross-fold validation and hyperparameter tuning without risking data leaks. The most sensical solution I've found (according to my knowledge) involves creating a pipeline that includes the scalar and GridSeachCV, for when you want to grid search and cross-fold validate. I've also read that, even when using cross-fold validation, it is useful to, at the very beginning, create a hold-out test set for an additional, final evaluation of your model after hyperparameter tuning. Putting that all together looks like this:
# train, test, split, unscaled data to create a final test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# instantiate pipeline with scaler and model, so that each training set
# in each fold is fit to the scalar and each training/test set in each fold
# is respectively transformed by fit scalar, preventing data leaks between each test/train
pipe = Pipeline([('sc', StandardScaler()),
('knn', KNeighborsClassifier())
])
# define hypterparameters to search
params = {'knn_n_neighbors': [3, 5, 7, 11]}
# create grid
search = GridSearchCV(estimator=pipe,
param_grid=params,
cv=5,
return_train_Score=True)
search.fit(X_train, y_train)
Assuming my understanding and the above process is correct, my question is what's next?
My guess is that we:
fit X_train to our scaler
transform X_train and X_test with our scaler
train a new model using X_train and our newly discovered best parameters from the grid search process
test the new model with our very first holdout-test set.
Presumably, because the Gridsearch evaluated models with scaling based on various slices of the data, the difference in values from scaling our final, whole train and test data should be fine.
Finally, when it is time to process completely new data points through our production model, do those datapoints need to be transformed according to the scalar fit to our original X_train?
Thank you for any help. I hope I am not completely misunderstanding fundamental aspects of this process.
Bonus Question:
I've seen example code like above from a number of sources. How does pipeline know to fit the scalar to the crossfold's training data, then transform the training and test data? Usually we have to define that process:
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)
GridSearchCV will help you find the best set of hyperparameter according to your pipeline and dataset. In order to do that it will use cross validation (split the your train dataset into 5 equal subset in you case). This means that your best_estimator will be trained on 80% of the train set.
As you know the more data a model see, the better its result is. Therefore once you have the optimal hyperparameters, it is wise to retrain the best estimator on all your training set and assess its performance with the test set.
You can retrain the best estimator using the whole train set by specifying the parameter refit=True of the Gridsearch and then score your model on the best_estimator as follows:
search = GridSearchCV(estimator=pipe,
param_grid=params,
cv=5,
return_train_Score=True,
refit=True)
search.fit(X_train, y_train)
tuned_pipe = search.best_estimator_
tuned_pipe.score(X_test, y_test)

Getting proper cross validation scores with grid search and pipelines in sklearn

My setup: I am running a process (=pipeline) in which I run a regression after having selected the relevant variables (after standardizing data - steps I have omitted since they are irrelevant in this instance) that I will optimize through a grid search, as shown below
fold = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=777)
regression_estimator = LogisticRegression(penalty='l2', random_state=777, max_iter=10000, tol=10, solver='newton-cg')
pipeline_steps = [('feature_selection', SelectKBest(f_regression)), ('regression', regression_estimator)]
pipe = Pipeline(steps=pipeline_steps)
feature_selection_k_options = np.arange(1, 33, 3)
param_grid = {'feature_selection__k': feature_selection_k_options}
gs = GridSearchCV(pipe, param_grid=param_grid, scoring='recall', cv=fold)
gs.fit(X, y)
since by default refit=True in the GridSearchCV, I am getting the best_estimator by default and I am fine with it. What I am missing is, given this best_estimator, how I am getting the cross validated scores on only the TEST data I split beforehand in the procedure. in fact, there is .score(X, Y) method but, as the docs dictate (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba) "Returns the mean accuracy on the given test data and labels" whereas I want what is done through cross_val_score (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html). the problem is that this procedure re-runs everything and keeps only those results (I want to have all the quantities that come out from this process).
In essence, I want to exctract, from the best estimator, the cross validated score on the test data with a measure of my choosing (or the one already selected in the grid search) and with the CrossValidated algorithm already embedded in my Pipeline (the StratifiedShuffleSplit in this case)
do you know how to do it?
You can access the cross validation score through the cv_results_ attribute which can be read conviniently into a pandas DataFrame:
import pandas as pd
df_result = pd.DataFrame(gs.cv_results_)
Regarding "with a measure of my choosing", you can check out this example showing how multiple scorers can be calculated at once within GridSearchCV.

How to use cross-validation splitted data with RandomizedSearchCV

I'm trying to transfer my model from single run to hyper-parameter tuning using RandomizedSearchCV.
In my single run case, my data is splitted into train/validation/test data.
When I run RandomizedSearchCV on my train_data with default 3-fold CV, I notice that the length of my train_input is reduced to 66% of train_data (which makes sense in a 3-fold CV...).
So I'm guessing that I should merge my initial train and validation set into a larger train set and let RandomizedSearchCV split it into train and validation sets.
Would that be the right way to go?
My question is: how can I access the remaining 33% of my train_input to feed it to my validation accuracy test function (note that my score function is running on test set)?
Thanks for your help!
Yoann
I'm not sure that my code would help here since my question is rather generic.
This is the answer that I found by going through sklearn's code: the RandomizedSearchCV doesn't return the splited validation data in an easy way and I should definitely merge my initial train and validation set into a larger train set and let RandomizedSearchCV split it into train and validation sets.
The train_data is splitted for CV using a cross-validator into a train/validation set (in my case, the Stratified K-Folds http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)
My estimator is defined as follows:
class DNNClassifier(BaseEstimator, ClassifierMixin):
It needs a score function to be able to evaluate the CV performance on the validation set. There is a default score function defined in the ClassifierMixin class (which returns the the mean accuracy and requires a predict function to be implemented in the Estimator class).
In my case, I implemented a custom score function within my estimator class.
The hyperparameter search and CV fit is done calling the fit function of RandomizedSearchCV.
RandomizedSearchCV(DNNClassifier(), param_distribs).fit(train_data)
This fit function runs the estimator's custom fit function on the train set and then the score function on the validation set.
This is done using the _fit_and_score function from the ._validation library.
So I can access the automatically splitted validation set (33% of my train_data input) at the end of my estimator's fit function.
I'd have preferred to access it within my estimator's fit function so that I can use it to plot validation accuracy over training steps and for early stop (I'll keep a separate validation set for that).
I guess I could reconstruct the automatically generated validation set by looking for the missing indexes from my initial train_data (the train_data used in the estimator's fit function has 66% of the indexes of the initial train_data).
If that is something that someone has already done I'd love to hear about it!

How to extract average metrics with Cross-Validation in PySpark

I'm trying to perform a Cross-Validation over Random Forest in Spark 1.6.0 and I'm finding hard to obtain the evaluation metrics (precision, recall, f1...). I want the average of the metrics of all folds. Is this possible to obtain them with CrossValidator and MulticlassClassificationEvaluator?
I only found examples where the evaluation is performed later over an independent test dataset and using the best model from the Cross-Validation. I'm not planning to use a train and test set, but to use all the dataframe (df) for the cross validation, let it make the splits, and then take the average metrics.
paramGrid = ParamGridBuilder().build()
evaluator = MulticlassClassificationEvaluator()
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=5)
model = crossval.fit(df)
evaluator.evaluate(model.transform(df))
For now, I obtain the best model metric with the last line of the above code evaluator.evaluate(model.transform(df)) and I'm not totally sure that I'm doing it correctly.
In Spark 2.x, it is possible to get the average metrics using model.avgMetrics. This returns an array of double containing the metrics used to train your cross validation model.
For MulticlassClassificationEvaluator, this gives an array of: f1, weightedPrecision, weightedRecall, accuracy (as documented here). These metrics can be overridden as needed using setter in the evaluator class.
If you also need to get the best model parameters chosen by the cross validator, please see my answer in here.

advanced feature extraction for cross-validation using sklearn

Given a sample dataset with 1000 samples of data, suppose I would like to preprocess the data in order to obtain 10000 rows of data, so each original row of data leads to 10 new samples. In addition, when training my model I would like to be able to perform cross validation as well.
The scoring function I have uses the original data to compute the score so I would like cross validation scoring to work on the original data as well rather than the generated one. Since I am feeding the generated data to the trainer (I am using a RandomForestClassifier), I cannot rely on cross-validation to correctly split the data according to the original samples.
What I thought about doing:
Create a custom feature extractor to extract features to feed to the classifier.
add the feature extractor to a pipeline and feed it to, say, GridSearchCv for example
implement a custom scorer which operates on the original data to score the model given a set of selected parameters.
Is there a better method for what I am trying to accomplish?
I am asking this in connection to a competition going on right now on Kaggle
Maybe you can use Stratified cross validation (e.g. Stratified K-Fold or Stratified Shuffle Split) on the expanded samples and use the original sample idx as stratification info in combination with a custom score function that would ignore the non original samples in the model evaluation.

Resources