I am following the code of the "Hands on Machine learning with Sci-kit learn and tensorflow 2nd edition" (ipynb link). In the section on selection the training and test data sets, the author brings up the importance of writing the splitting function so that the test set will stay consistent over multiple runs, even if the data set is refreshed. The code is written so that an updated data set will still have the right percentage (test ratio) for splitting the test and training sets, but the new test set won't contain any instance that was previously in the training set. It does this by creating a number for the index value(identifier/id_x) and returning true if that number is between 0 and (test ratio) of the range of possible numbers that could be selected.
from zlib import crc32
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
This part makes sense, but what I don't understand is how to implement the same thing using the function train_test_split from skilearn is there something specific to do that if the whole data set is updated then the test set never includes a value that was already selected to be in the training set. Is this something that is already included if we include the random_state argument and make sure that the updated data set only adds rows to the existing data set and never deletes rows? Is that a realistic thing to require?
Is this a problem to worry about with cross validation as well?
Thanks for your help.
If you're worried about that rows from training set once will be part of testing set and will eventually effect the performance of classifier when using test_train_split I would suggest don't worry that won't happen. In order to keep training and testing consistent over multiple runs random_state can be used.
Related
I'd like to use GridSearchCV, but with the condition that the lowest index of the data in the validation set be greater than the largest in the training set. The reason being that the data is in time, and future data gives unfair insight that would inflate the score. There's some discussion on this:
If the data ordering is not arbitrary (e.g. samples with the same class label are contiguous), shuffling it first may be essential to get a meaningful cross- validation result. However, the opposite may be true if the samples are not independently and identically distributed. For example, if samples correspond to news articles, and are ordered by their time of publication, then shuffling the data will likely lead to a model that is overfit and an inflated validation score: it will be tested on samples that are artificially similar (close in time) to training samples.
but it's not clear to me whether any of the splitting methods listed can accomplish what i'm looking for. It seems to be the case that I can define an itterable of indices and pass that into cv, but in that case it's not clear how many I should define (does it always use all of them? do different tests get different indices?)
I tuned a RandomForest with GroupKFold (to prevent data leakage because some rows came from the same group).
I get a best fit model, but when I go to make a prediction on the test data it says that it needs the group feature.
Does that make sense? Its odd that the group feature is coming up as one of the most important features as well.
I'm just wondering if there is something I could be doing wrong.
Thanks
A search on the scikit-learn Github repo does not reveal a single instance of the string "group feature" or "group_feature" or anything similar, so I will go ahead and assume you have in your data set a feature called "group" that the prediction model requires as input in order to produce an output.
Remember that a prediction model is basically a function that takes an input (the "predictor" variable) and returns an output (the "predicted" variable). If a variable called "group" was defined as input for your prediction model, then it makes sense that scikit-learn would request it.
Does the group appear as a column on the training set? If so, remove it and re-train. It looks like you are just using it to generate splits. If it isn't a part of the input data you need to predict, it shouldn't be in the training set.
The data set i'm working with consists of train and test sets. To fine tune the deep learning model, 10% of training set is used as validation set. After finding the optimal hyper parameter values, two possible options are
a) Evaluate the model (i.e., the model which is trained on 90 % of train set) with test set
b) Evaluate the model (i.e., the model retrained on complete train set) with test set
Which is of the above options is valid? and why?
Both options are possible,
But in the First case the HPP are optimal, in the second case they are usually close to the optimal hyper parameters (but not optimal), but you have a more representative dataset
What is suggested in general is to do a CrossValidation https://scikit-learn.org/stable/modules/cross_validation.html -> choose different train/test to have a more representative case, and choose the best HPP base on an average on the values for each folds
Because, what you risk is, your model would be very good for this specific case as you can have for Kaggle datasets, but maybe not representative of real use cases once in production.
To sum up:
1. If you just want the best model for this set, maybe option 1 is safest (option 2 can be done too, but you may have worse results)
2. If you are in a "real case study", you should better do cross-validation to have more robust set of HPPs
I want to do two Cross Validation processes in Spark using RandomSplits like
CV_global: by splitting data into Training Set 90% and Testing Set 10%
1.1. CV_grid: grid search on half of Training Set, i.e. 45% of data.
1.2. Fit Model: on Training set (90%) using the best settings from CV_grid.
1.3 Test Model: on Testing set (10%)
Report Average metrics per 10-fold and global metrics.
The problem is I only find examples using CV and Grid search on the whole training set.
How can I get the parameters of the best performing model from CV_grid?
How to do CV without grid search but get stats per fold? e.g.
sklearn.cross_validation.cross_val_score
You have things like
crossval.setEstimatorParamMaps(paramGrid)
and then
cvModel = crossval.fit(trainingSetDF).bestModel
For single models (at least for some) there are functions like explainParams(). It's available in spark 1.6 (maybe it goes back to 1.4.2, I'm not sure).
Hope this helps
You have three questions into one. The answers for each:
1. The problem is I only find examples using CV and Grid search on the whole training set.
if you need just a portion of your training dataset, then sample at the wanted percentage, e.g.
training = training.sample(false, .45, 78L)
2. How can I get the parameters of the best performing model from CV_grid?
crossValidatedModel.bestModel().getParamMap()
get from there the parameters names , and then values.
3. How to do CV without grid search but get stats per fold? e.g.
duplicate of How can I access computed metrics for each fold in a CrossValidatorModel
Take a look here: Spark CrossValidatorModel access other models than the bestModel?
Given a set of features extracted from a training dataset which are used to train a SVM.
The SVM parameters (e.g. c, gamma) are chosen using k-folds cross validation e.g. the training dataset is divided into 5 folds, with one chosen as validation set. Rotation of folds is done and the average accuracy used to choose the best parameters.
So then should I have another set (Test set) and report (as in paper publication) the results on this ? My understanding is that since the validation set was used to choose the parameters, the Test set is required.
In machine learning, the Test set is something not seen until we have decided on the classifier (e.g. in competitions, the test set is unknown and we submit our final classifier based only on the training set).
The common approach is that after the cross validation phase, you would need to tune your parameters further and hence the need of a validation set to control the quality of each model.
Once you have a model that you believe can't be improved significantly over the validation set without risk of over-fitting, then you use your model over the test set to report results.
EDIT:
Since you are specifically asking about k-fold cross-validation, the technique implicitly separates a model for testing the resulted model, hence there is no need for an extra test step.
From the wikipedia article:
"Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data"
Wikipedia