I am confused in using walking forward validation. I have found several pages that have different or not clear wording on "validation" and "test".
Basically, when I apply it to time series forecasting, I divide the data into training, validation and test set. The validation is then a part of the training set. I understood that. Obviously, for time series, time has to be taken into account, so we have to resort to the walking forward method.
My understanding is that the walking forward method is only applied in the training set so that, for example, hyperparameters can be optimized when they are created. The test set plays no role and is only used at the end to evaluate the model. Is this correct?
Or does the walking forward method split the test set?
I see many examples that do not consider this forward method.
If it is the walking_forward method, it is the validation set:
Another question is, if I don't split the validation and training set before the backpropagation and I choose in the "model.fit" the settings:
model.fit(
"data=X_train, y_train",
"validation_data = X_train, y_train",
"valdidation_split=0.1",
"shuffle=False")
is this model similar to the walking_forward method?
Related
I used the 'RandomizedSearchCV' function to estimate my optimal parameters for a random forest model. Can I use the 'best_estimator_' attribute to predict on my test data?
The question I have is, while doing the randomized search, part of the data would have been used for validation. So the best estimate RF model wouldn't have been trained on the entire data set rt? Or is it all taken care of under the hood?
As written in the documentation cv.best_estimator_ returns the estimator that was chosen by the search, i.e. estimator which gave highest score.
If the parameter refit is set to True (default value), the model will be refit the model using the best parameter on the whole dataset including validation. Therefore you can simply use the cv.best_estimator_ to predict on your test data.
I'm training a token classification (AKA named entity recognition) model with the HuggingFace Transformers library, with a customized data loader.
Like most NER datasets (I'd imagine?) there's a pretty significant class imbalance: A large majority of tokens are other - i.e. not an entity - and of course there's a little variation between the different entity classes themselves.
As we might expect, my "accuracy" metrics are getting distorted quite a lot by this: It's no great achievement to get 80% token classification accuracy if 90% of your tokens are other... A trivial model could have done better!
I can calculate some additional and more insightful evaluation metrics - but it got me wondering... Can/should we somehow incorporate these weights into the training loss? How would this be done using a typical *ForTokenClassification model e.g. BERTForTokenClassification?
This is actually a really interesting question, since it seems there is no intention (yet) to modify losses in the models yourself. Specifically for BertForTokenClassification, I found this code segment:
loss_fct = CrossEntropyLoss()
# ...
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
To actually change the loss computation and add other parameters, e.g., the weights you mention, you can go about either one of two ways:
You can modify a copy of transformers locally, and install the library from there, which makes this only a small change in the code, but potentially quite a hassle to change parts during different experiments, or
You return your logits (which is the case by default), and calculate your own loss outside of the actual forward pass of the huggingface model. In this case, you need to be aware of any potential propagation from the loss calculated within the forward call, but this should be within your power to change.
I've been working with the MLPClassifier for a while and I think I had a wrong interpretation of what the function is doing for the whole time and I think I got it right now, but I am not sure about that. So I will summarize my understanding and it would be great if you could add your thoughts on the right understanding.
So with the MLPClassifier we are building a neural network based on a training dataset. Setting early_stopping = True it is possible to use a validation dataset within the training process in order to check whether the network is working on a new set as well. If early_stopping = False, no validation within he process is done. After one has finished building, we can use the fitted model in order to predict on a third dataset if we wish to.
What I was thiking before is, that doing the whole training process a validation dataset is being taken aside anways with validating after every epoch.
I'm not sure if my question is understandable, but it would be great if you could help me to clear my thoughts.
The sklearn.neural_network.MLPClassifier uses (a variant of) Stochastic Gradient Descent (SGD) by default. Your question could be framed more generally as how SGD is used to optimize the parameter values in a supervised learning context. There is nothing specific to Multi-layer Perceptrons (MLP) here.
So with the MLPClassifier we are building a neural network based on a training dataset. Setting early_stopping = True it is possible to use a validation dataset within the training process
Correct, although it should be noted that this validation set is taken away from the original training set.
in order to check whether the network is working on a new set as well.
Not quite. The point of early stopping is to track the validation score during training and stop training as soon as the validation score stops improving significantly.
If early_stopping = False, no validation within [t]he process is done. After one has finished building, we can use the fitted model in order to predict on a third dataset if we wish to.
Correct.
What I was thiking before is, that doing the whole training process a validation dataset is being taken aside anways with validating after every epoch.
As you probably know by now, this is not so. The division of the learning process into epochs is somewhat arbitrary and has nothing to do with validation.
I tuned a RandomForest with GroupKFold (to prevent data leakage because some rows came from the same group).
I get a best fit model, but when I go to make a prediction on the test data it says that it needs the group feature.
Does that make sense? Its odd that the group feature is coming up as one of the most important features as well.
I'm just wondering if there is something I could be doing wrong.
Thanks
A search on the scikit-learn Github repo does not reveal a single instance of the string "group feature" or "group_feature" or anything similar, so I will go ahead and assume you have in your data set a feature called "group" that the prediction model requires as input in order to produce an output.
Remember that a prediction model is basically a function that takes an input (the "predictor" variable) and returns an output (the "predicted" variable). If a variable called "group" was defined as input for your prediction model, then it makes sense that scikit-learn would request it.
Does the group appear as a column on the training set? If so, remove it and re-train. It looks like you are just using it to generate splits. If it isn't a part of the input data you need to predict, it shouldn't be in the training set.
Given a set of features extracted from a training dataset which are used to train a SVM.
The SVM parameters (e.g. c, gamma) are chosen using k-folds cross validation e.g. the training dataset is divided into 5 folds, with one chosen as validation set. Rotation of folds is done and the average accuracy used to choose the best parameters.
So then should I have another set (Test set) and report (as in paper publication) the results on this ? My understanding is that since the validation set was used to choose the parameters, the Test set is required.
In machine learning, the Test set is something not seen until we have decided on the classifier (e.g. in competitions, the test set is unknown and we submit our final classifier based only on the training set).
The common approach is that after the cross validation phase, you would need to tune your parameters further and hence the need of a validation set to control the quality of each model.
Once you have a model that you believe can't be improved significantly over the validation set without risk of over-fitting, then you use your model over the test set to report results.
EDIT:
Since you are specifically asking about k-fold cross-validation, the technique implicitly separates a model for testing the resulted model, hence there is no need for an extra test step.
From the wikipedia article:
"Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data"
Wikipedia