advanced feature extraction for cross-validation using sklearn - scikit-learn

Given a sample dataset with 1000 samples of data, suppose I would like to preprocess the data in order to obtain 10000 rows of data, so each original row of data leads to 10 new samples. In addition, when training my model I would like to be able to perform cross validation as well.
The scoring function I have uses the original data to compute the score so I would like cross validation scoring to work on the original data as well rather than the generated one. Since I am feeding the generated data to the trainer (I am using a RandomForestClassifier), I cannot rely on cross-validation to correctly split the data according to the original samples.
What I thought about doing:
Create a custom feature extractor to extract features to feed to the classifier.
add the feature extractor to a pipeline and feed it to, say, GridSearchCv for example
implement a custom scorer which operates on the original data to score the model given a set of selected parameters.
Is there a better method for what I am trying to accomplish?
I am asking this in connection to a competition going on right now on Kaggle

Maybe you can use Stratified cross validation (e.g. Stratified K-Fold or Stratified Shuffle Split) on the expanded samples and use the original sample idx as stratification info in combination with a custom score function that would ignore the non original samples in the model evaluation.

Related

Using the best predictor from Randomized search for test data

I used the 'RandomizedSearchCV' function to estimate my optimal parameters for a random forest model. Can I use the 'best_estimator_' attribute to predict on my test data?
The question I have is, while doing the randomized search, part of the data would have been used for validation. So the best estimate RF model wouldn't have been trained on the entire data set rt? Or is it all taken care of under the hood?
As written in the documentation cv.best_estimator_ returns the estimator that was chosen by the search, i.e. estimator which gave highest score.
If the parameter refit is set to True (default value), the model will be refit the model using the best parameter on the whole dataset including validation. Therefore you can simply use the cv.best_estimator_ to predict on your test data.

Process training and test sets in GridSearchCV

I want to evaluate different classifiers in performing the link-prediction task by using node embedding algorithms. More specifically, I want to evaluate if node embedding can improve the accuracy of different classifiers predicting new links between nodes.
My idea is the following:
I create a dataset containing both positive and negative samples (real links and non-existing links)
I split the dataset in Development Test (DS) and Evaluation Test (ES).
I use the DS to perform the Grid Search cross-validation (CV) to find the best model
I train the best model on the entire DS, and then I evaluate its performance on ES.
The problem is the following: I cannot use node embedding algorithms on the entire dataset because, in this case, ES will contain information related to the original graph topology. Therefore, I need to extract node embeddings from the training and test sets generated during the Grid Search CV, but how can I do it by using the sklearn.model_selection.GridSearchCV class?

How to use cross-validation splitted data with RandomizedSearchCV

I'm trying to transfer my model from single run to hyper-parameter tuning using RandomizedSearchCV.
In my single run case, my data is splitted into train/validation/test data.
When I run RandomizedSearchCV on my train_data with default 3-fold CV, I notice that the length of my train_input is reduced to 66% of train_data (which makes sense in a 3-fold CV...).
So I'm guessing that I should merge my initial train and validation set into a larger train set and let RandomizedSearchCV split it into train and validation sets.
Would that be the right way to go?
My question is: how can I access the remaining 33% of my train_input to feed it to my validation accuracy test function (note that my score function is running on test set)?
Thanks for your help!
Yoann
I'm not sure that my code would help here since my question is rather generic.
This is the answer that I found by going through sklearn's code: the RandomizedSearchCV doesn't return the splited validation data in an easy way and I should definitely merge my initial train and validation set into a larger train set and let RandomizedSearchCV split it into train and validation sets.
The train_data is splitted for CV using a cross-validator into a train/validation set (in my case, the Stratified K-Folds http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)
My estimator is defined as follows:
class DNNClassifier(BaseEstimator, ClassifierMixin):
It needs a score function to be able to evaluate the CV performance on the validation set. There is a default score function defined in the ClassifierMixin class (which returns the the mean accuracy and requires a predict function to be implemented in the Estimator class).
In my case, I implemented a custom score function within my estimator class.
The hyperparameter search and CV fit is done calling the fit function of RandomizedSearchCV.
RandomizedSearchCV(DNNClassifier(), param_distribs).fit(train_data)
This fit function runs the estimator's custom fit function on the train set and then the score function on the validation set.
This is done using the _fit_and_score function from the ._validation library.
So I can access the automatically splitted validation set (33% of my train_data input) at the end of my estimator's fit function.
I'd have preferred to access it within my estimator's fit function so that I can use it to plot validation accuracy over training steps and for early stop (I'll keep a separate validation set for that).
I guess I could reconstruct the automatically generated validation set by looking for the missing indexes from my initial train_data (the train_data used in the estimator's fit function has 66% of the indexes of the initial train_data).
If that is something that someone has already done I'd love to hear about it!

Spark Cross Validation with Training, Testing and Validation sets

I want to do two Cross Validation processes in Spark using RandomSplits like
CV_global: by splitting data into Training Set 90% and Testing Set 10%
1.1. CV_grid: grid search on half of Training Set, i.e. 45% of data.
1.2. Fit Model: on Training set (90%) using the best settings from CV_grid.
1.3 Test Model: on Testing set (10%)
Report Average metrics per 10-fold and global metrics.
The problem is I only find examples using CV and Grid search on the whole training set.
How can I get the parameters of the best performing model from CV_grid?
How to do CV without grid search but get stats per fold? e.g.
sklearn.cross_validation.cross_val_score
You have things like
crossval.setEstimatorParamMaps(paramGrid)
and then
cvModel = crossval.fit(trainingSetDF).bestModel
For single models (at least for some) there are functions like explainParams(). It's available in spark 1.6 (maybe it goes back to 1.4.2, I'm not sure).
Hope this helps
You have three questions into one. The answers for each:
1. The problem is I only find examples using CV and Grid search on the whole training set.
if you need just a portion of your training dataset, then sample at the wanted percentage, e.g.
training = training.sample(false, .45, 78L)
2. How can I get the parameters of the best performing model from CV_grid?
crossValidatedModel.bestModel().getParamMap()
get from there the parameters names , and then values.
3. How to do CV without grid search but get stats per fold? e.g.
duplicate of How can I access computed metrics for each fold in a CrossValidatorModel
Take a look here: Spark CrossValidatorModel access other models than the bestModel?

Is separate validation and test set needed when training SVM?

Given a set of features extracted from a training dataset which are used to train a SVM.
The SVM parameters (e.g. c, gamma) are chosen using k-folds cross validation e.g. the training dataset is divided into 5 folds, with one chosen as validation set. Rotation of folds is done and the average accuracy used to choose the best parameters.
So then should I have another set (Test set) and report (as in paper publication) the results on this ? My understanding is that since the validation set was used to choose the parameters, the Test set is required.
In machine learning, the Test set is something not seen until we have decided on the classifier (e.g. in competitions, the test set is unknown and we submit our final classifier based only on the training set).
The common approach is that after the cross validation phase, you would need to tune your parameters further and hence the need of a validation set to control the quality of each model.
Once you have a model that you believe can't be improved significantly over the validation set without risk of over-fitting, then you use your model over the test set to report results.
EDIT:
Since you are specifically asking about k-fold cross-validation, the technique implicitly separates a model for testing the resulted model, hence there is no need for an extra test step.
From the wikipedia article:
"Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data"
Wikipedia

Resources