Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am working myself through a book on machine learning right now.
Working on a NaiveBayesClassifier the author is very much in favour of the cross-validation method.
He proposes to split the data into ten buckets (files) and train on nine of them each time withholding a different bucket.
Up to now the only approach I am familiar with is to split the data into a training set and a test set in the ratio of 50%/50% and simply train the classifier all at once.
Could someone explain what are possible advantages of using cross-validation?
Cross-validation is a way to address the tradeoff between bias and variance.
When you obtain a model on a training set, your goal is to minimize variance. You can do this by adding more terms, higher order polynomials, etc.
But your true objective is to predict outcomes for points that your model has never seen. That's what holding out the test set simulates.
You'll create your model on a training set, then try it out on a test set. You will find that there's a minimum combination of variance and bias that will give your best results. The simplest model that minimizes both should be your choice.
I'd recommend "An Intro to Statistical Learning" or "Elements of Statistical Learning" by Hastie and Tibshirani for more details.
The general objective of machine learning is that the more the data you have for training, the better results you would get. This was important to state before i start answering the question.
Cross-validation helps us to avoid overfitting of the model and it also helps to increase the generalization accuracy which is the accuracy of the model on unseen future point. Now when you divide your dataset into dtrain and dtest there is one problem with that which is if your function that would be determined once you have trained your model needs both training and testing data, then you cannot say your accuracy on future unseen point would be same as the accuracy you got on your test data. This above argument can be stated by taking the example of k-nn where nearest neighbour is determined with the help of training data while the value of k is determined by test data.
But if you use CV then k could be determined by CV data and your test data can be considered as the unseen data point.
Now suppose you divide your dataset into 3 parts Dtrain(60%), Dcv(20%) and Dtest(20%). Now you have only 60% of data to train with. Now suppose you want to use all the 80% of your data to train then you can do this with the help of m-fold cross validation. In m-fold CV you divide your data into two parts Dtrain and Dtest (lets say 80 and 20).
lets say the value of m is 4 so you divide the training data into 4 equal parts randomly (d1,d2,d3,d4). Now start training the model by taking d1,d2,d3 as dtrain and d4 as cv and calculate the accuracy, in the next go take d2,d3,d4 as dtrain and d1 as cv and likewise take all possiblity for m=1, then for m=2 continue the same procedure. With this you use your entire 80% of your dtrain and your dtest can be considered as future unseen dataset.
The advantages are better and more use of your dtrain data, reduce the overfit and helps to give you sured generalization accuracy.
but on the downside the time complexity is high.
In your case the value of m is 10.
Hope this helps.
The idea is to have maximum no. of points to train the model to achieve accurate results.For every data point chosen in the train set, it is excluded from the test set. So we use the concept of k and k-1 where we firstly divide the data set into equal k sized bins and we take one bin make it a test set and the remaining k-1 bins represent train set. We repeat the process till all the bins are selected once as test set(k) and the remaining as training(k-1).Doing this no data point in missed out for training purpose
Related
I am classifying data with categorical variables. It is data where people have provided information.
My training dataset is of varying quality. I have a greater confidence in some of the data i.e. I have a higher confidence that people have provided correct information whereas in some the data I am not so sure.
How can I pass this information into a classification algorithm such as Naive Bayes or K nearest neighbour?
Or should I instead look to another algorithm?
I think what you want to do, is to provide individual weights (for the importance/confidence) for each data point you have.
For instance, if you are very certain that one data point is of higher quality and should have a higher weight than others, in which you are less confident in, you can specify that when fitting your classifier.
Sklearn provides for instance the Gaussian Naive Bayes classifier (GaussianNB) for that.
Here, you can specify sample_weights when calling the fit() method.
I have limited knowledge about sample_weights in the sklearn library, but from what I gather, it's generally used to help balance imbalanced datasets during training. What I'm wondering is, if I already have a perfectly balanced binary classification dataset (i.e. equal amounts of 1's and 0's in the label/Y/class column), could one add a sample weight to the 0's in order to put more importance on predicting the 1's correctly?
For example, let's say I really want my model to predict 1's well, and it's ok to predict 0's even though they turn out to be 1's. Would setting a sample_weight of 2 for 0's, and 1 for the 1's be the correct thing to do here in order to put more importance on correctly predicting the 1's? Or does that matter? And then I guess during training, is the f1 scoring function generally accepted as the best metric to use?
Thanks for the input!
ANSWER
After a couple rounds of testing and more research, I've discovered that yes, it does make sense to add more weight to the 0's with a balanced binary classification dataset, if your goal is to decrease the chance of over-predicting the 1's. I ran two separate training sessions using a weight of 2 for 0's and 1 for the 1's, and then again vice versa, and found that my model predicted less 1's when the weight was applied to the 0's, which was my ultimate goal.
In case that helps anyone.
Also, I'm using SKLearn's Balanced Accuracy scoring function for those tests, which takes an average of each separate class's accuracy.
I am currently fitting a neural network to predict a continuous target from 1 to 10. However, the samples are not evenly distributed over the entire data set: samples with target ranging from 1-3 are quite underrepresented (only account for around 5% of the data). However, they are of big interest, since the low range of the target is kind of the critical range.
Is there any way to know how my model predicts these low range samples in particular? I know that when doing multiclass classification I can examine the recall to get a taste of how well the model performs on a certain class. For classification use cases I can also set the class weight parameter in Keras to account for class imbalances, but this is obviously not possible for regression.
Until now, I use typical metrics like MAE, MSE, RMSE and get satisfying results. I would however like to know how the model performs on the "critical" samples.
From my point of view, I would compare the test measurements (classification performance, MSE, RMSE) for the whole test step that corresponds to the whole range of values (1-10). Then, of course, I would do it separately to the specific range that you are considering critical (let's say between 1-3) and compare the divergence of the two populations. You can even perform some statistics about the significance of the difference between the two populations (Wilcoxon tests etc.).
Maybe this link could be useful for your comparisons. Since you can regression you can even compare for MSE and RMSE.
What you need to do is find identifiers for these critical samples. Often times row indices are used for this. Once you have predicted all of your samples, use those stored indices to find the critical samples in your predictions and run whatever automatic metric over those filtered samples. I hope this answers your question.
I am working on a time-series prediction problem using GradientBoostingRegressor, and I think I'm seeing significant overfitting, as evidenced by a significantly better RMSE for training than for prediction. In order to examine this, I'm trying to use sklearn.model_selection.cross_validate, but I'm having problems understanding the result.
First: I was calculating RMSE by fitting to all my training data, then "predicting" the training data outputs using the fitted model and comparing those with the training outputs (the same ones I used for fitting). The RMSE that I observe is the same order of magnitude the predicted values and, more important, it's in the same ballpark as the RMSE I get when I submit my predicted results to Kaggle (although the latter is lower, reflecting overfitting).
Second, I use the same training data, but apply sklearn.model_selection.cross_validate as follows:
cross_validate( predictor, features, targets, cv = 5, scoring = "neg_mean_squared_error" )
I figure the neg_mean_squared_error should be the square of my RMSE. Accounting for that, I still find that the error reported by cross_validate is one or two orders of magnitude smaller than the RMSE I was calculating as described above.
In addition, when I modify my GradientBoostingRegressor max_depth from 3 to 2, which I would expect reduces overfitting and thus should improve the CV error, I find that the opposite is the case.
I'm keenly interested to use Cross Validation so I don't have to validate my hyperparameter choices by using up Kaggle submissions, but given what I've observed, I'm not clear that the results will be understandable or useful.
Can someone explain how I should be using Cross Validation to get meaningful results?
I think there is a conceptual problem here.
If you want to compute the error of a prediction you should not use the training data. As the name says theese type of data are used only in training, for evaluating accuracy scores you ahve to use data that the model has never seen.
About cross-validation I can tell that it's an approach to find the best training/testing set. The process is as follows: you divide your data into n groups and you do various iterating changing the testing group you pick. If you have n groups you will do n iteration and each time the training and testing set will be different. It's more understamdable in the image below.
Basically what you should do it's kile this:
Train the model using months from 0 to 30 (for example)
See the predictions made with months from 31 to 35 as input.
If the input has to be the same lenght divide feature in half (should be 17 months).
I hope I understood correctly, othewise comment.
I am using sklearn's random forests module to predict values based on 50 different dimensions. When I increase the number of dimensions to 150, the accuracy of the model decreases dramatically. I would expect more data to only make the model more accurate, but more features tend to make the model less accurate.
I suspect that splitting might only be done across one dimension which means that features which are actually more important get less attention when building trees. Could this be the reason?
Yes, the additional features you have added might not have good predictive power and as random forest takes random subset of features to build individual trees, the original 50 features might have got missed out. To test this hypothesis, you can plot variable importance using sklearn.
Your model is overfitting the data.
From Wikipedia:
An overfitted model is a statistical model that contains more parameters than can be justified by the data.
https://qph.fs.quoracdn.net/main-qimg-412c8556aacf7e25b86bba63e9e67ac6-c
There are plenty of illustrations of overfitting, but for instance, this 2d plot represents the different functions that would have been learned for a binary classification task. Because the function on the right has too many parameters, it learns wrongs data patterns that don't generalize properly.