Sklearn overfitting - scikit-learn

I have a data set containing 1000 points each with 2 inputs and 1 output. It has been split into 80% for training and 20% for testing purpose. I am training it using sklearn support vector regressor. I have got 100% accuracy with training set but results obtained with test set are not good. I think it may be because of overfitting. Please can you suggest me something to solve the problem.

You may be right: if your model scores very high on the training data, but it does poorly on the test data, it is usually a symptom of overfitting. You need to retrain your model under a different situation. I assume you are using train_test_split provided in sklearn, or a similar mechanism which guarantees that your split is fair and random. So, you will need to tweak the hyperparameters of SVR and create several models and see which one does best on your test data.
If you look at the SVR documentation, you will see that it can be initiated using several input parameters, each of which could be set to a number of different values. For the simplicity, let's assume you are only dealing with two parameters that you want to tweak: 'kernel' and 'C', while keeping the third parameter 'degree' set to 4. You are considering 'rbf' and 'linear' for kernel, and 0.1, 1, 10 for C. A simple solution is this:
for kernel in ('rbf', 'linear'):
for c in (0.1, 1, 10):
svr = SVR(kernel=kernel, C=c, degree=4)
svr.fit(train_features, train_target)
score = svr.score(test_features, test_target)
print kernel, c, score
This way, you can generate 6 models and see which parameters lead to the best score, which will be the best model to choose, given these parameters.
A simpler way is to let sklearn to do most of this work for you, using GridSearchCV (or RandomizedSearchCV):
parameters = {'kernel':('linear', 'rbf'), 'C':(0.1, 1, 10)}
clf = GridSearchCV(SVC(degree=4), parameters)
clf.fit(train_features, train_target)
print clf.best_score_
print clf.best_params_
model = clf.best_estimator_ # This is your model
I am working on a little tool to simplify using sklearn for small projects, and make it a matter of configuring a yaml file, and letting the tool do all the work for you. It is available on my github account. You might want to take a look and see if it helps.
Finally, your data may not be linear. In that case you may want to try using something like PolynomialFeatures to generate new nonlinear features based on the existing ones and see if it improves your model quality.

Try fitting your data using training data split Sklearn K-Fold cross-validation, this provides you a fair split of data and better model , though at a cost of performance , which should really matter for small dataset and where the priority is accuracy.

A few hints:
Since you have only two inputs, it would be great if you plot your data. Try either a scatter with alpha = 0.3 or a heatmap.
Try GridSearchCV, as mentioned by #shahins.
Especially, try different values for the C parameter. As mentioned in the docs, if you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
If it's taking too long, you can also try RandomizedSearchCV
As a side note from #shahins answer (I am not allowed to add comments), both implementations are not equivalent. GridSearchCV is better since it performs cross-validation in the training set for tuning the hyperparameters. Do not use the test set for tuning hyperparameters!
Don't forget to scale your data

Related

Model underfitting

I have trained a model and it took me quite a while to find the correct hyperparameters.
The model has now been trained for 15h and it seems to to its job quite well.
When I observed the training and validation loss though, the training loss is somewhat higher than the validation loss. (red curve: training, green: validation)
I use dropout to regularize my model and as far as I have understood, droput is is only applied during training which might be the reason.
Now Iam wondering if I have trained a valid model?
It doesn't seem like the model is heavily underfitted?
Thanks in advance for any advice,
cheers,
M
First, check whether you have good data set, i.e., if it is a classification, then get equal number of images for all classes and get it from same source not from different sources. And regularization, dropout are used for overfitting/High variance so don't worry about these.
Then, I think your model is doing good when you trained your model the initial error between them are different but as you increased the epochs then they both got into some steady path. So it is good. And may be reason for this is as I mentioned above or you should try shuffle them then using train_test_split for getting better distribution of training and validation sets.
A plot of learning curves shows a good fit if:
The plot of training loss decreases to a point of stability.
The plot of validation loss decreases to a point of stability and has a small gap with the training loss.
In your case these conditions are satisfied.
Still if you want to deal with High Bias/underfitting then here are few methods:
Train bigger models
Train longer. Use better optimization techniques
Try different Neural Network Architecture and also hyper parameters
And also you can use cross-validation or GridSearchCV for finding better optimizer or hyper parameters but it may take really long because you have to train it on different parameters each time considering your time which is 15 hours then it might be very long but you will find better parameters and then train on it.
Above all I think your model is doing okay.
If your model underfits, its performance will be lower, similar as in the case of overfitting, because actually it can not learn effectively to get the optimal result, i.e the proper function to fit the given distribution. So you have to use less regularization technique e.g. less dropout to get the optimal result.
Furthermore the sampling can also be crucial, because there can be training-validation subsets where your model performs well on validation set and less effective on training set and vice-versa. This is one of the reason why we use crossvalidation and different sampling methods e.g. stratified k-fold.

How to select the best fit machine learning algorithm

while frequently running the machine learning algorithms the accuracy is changing in that case how to select the best fit algorithm for that particular data set.
You should definitely provide more details. It's impossible to suggest anything without the domain, the model architecture, hyperparameters.
I guess you are complaining due to changing of accuracy of the model. I think you should set seeds for randomized parameters so that accuracy don't change while training different times and you can reproduce your results.
numpy.random.seed(1)
random.seed(1)
tf.random.set_random_seed(1) # if using tensorflow
Lets assume , the question is for the same data set X (Training), everytime when we run the accuracy by comparing the predicted responses against our Testdata Dependent values(Y) .
If the accuracy keeps changing if we run the model seems, the issue is Sampling Bias ( the division of Training and Test data upholds a mystery).
When you import train_test_split function , use the random_state attribute wisely to keep the test data representative for the overall population of data.

Why does more features in a random forest decrease accuracy dramatically?

I am using sklearn's random forests module to predict values based on 50 different dimensions. When I increase the number of dimensions to 150, the accuracy of the model decreases dramatically. I would expect more data to only make the model more accurate, but more features tend to make the model less accurate.
I suspect that splitting might only be done across one dimension which means that features which are actually more important get less attention when building trees. Could this be the reason?
Yes, the additional features you have added might not have good predictive power and as random forest takes random subset of features to build individual trees, the original 50 features might have got missed out. To test this hypothesis, you can plot variable importance using sklearn.
Your model is overfitting the data.
From Wikipedia:
An overfitted model is a statistical model that contains more parameters than can be justified by the data.
https://qph.fs.quoracdn.net/main-qimg-412c8556aacf7e25b86bba63e9e67ac6-c
There are plenty of illustrations of overfitting, but for instance, this 2d plot represents the different functions that would have been learned for a binary classification task. Because the function on the right has too many parameters, it learns wrongs data patterns that don't generalize properly.

Confusion Matrix - Not changing with predictive models (Sklearn)

I have 3 predictive models and I am evaluating there performance with a confusion matrix.
I am getting the same results for the confusion matrix for each of the 3 models.
I expect that the different models would perform differently and produce different confusion matrices. I am new to predictive modelling, so I suspect I am making a "Rooky mistake" . The full script I am using is sitting in a Jupyter notebook on GiThub here
A screenshot of the code for the 3 models is below
Can some one point out what is going wrong?
Cheers
Mike
As mentioned: make predictions on the test data. But keep in mind that your targets are skewed! So use StratifiedKFolds or something like this.
Also I guess that your data is a bit corrupted. While all models show the same result it may be a big mistake underneath.
Few questions/advises:
1. Did you scale your data?
2. Did you use one-hot-encoding?
2. Use don't Decision Trees but Forests/XGBoost. Easy to overfit with DT.
3. Don't use >2 hidden layers in NN because it's easy to overfit too. Use 2 firstly. And your architecture (30, 30, 30) with 2 target classes seems weird.
4. And if you wish to use >2 hidden layers - go to Keras or TF. You'll find there many features that can help you to not overfit.
That is simply because you are using the same Training data to make predictions. Since your models are already trained on the same data that you are making the predictions on, they will return the same results (and ultimately the same confusion matrix). You need to split your dataset into training and test sets. Then train your classifier on training set and make predictions on test set.
You can use train_test_split in Sklearn to split your dataset into training or test set.

How to consistently standardize sparse feature matrix in scikit-learn?

I am using sklearn's DictVectorizer to construct a large, sparse feature matrix, which is fed to an ElasticNet model. Elastic net (and similar linear models) work best when predictors (columns in the feature matrix) are centered and scaled. The recommended approach is to build a Pipeline that uses a StandardScaler prior to the regressor, however that doesn't work with sparse features, as stated in the docs.
I thought to use the normalize=True flag in ElasticNet which seems to support sparse data, however it's not clear whether the normalization is applied during prediction to the test data as well. Does anyone know if normalize=True applies for prediction as well? If not, is there a way to use the same standardization on the training and test set when dealing with sparse features?
Digging through the sklearn code, it looks like when fit_intercept=True and normalize=True, the coefficients estimated on the normalized data are projected back to the original scale of the data. This is similar to the way glmnet in R handles standardization. The relevant code snippet is the method _set_intercept of LinearModel, see https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/base.py#L158. So predictions on unseen data use coefficients in the original scale, i.e., normalize=True is safe to use.

Resources