When I tuning Decision Tree using GridSearchCV in skelarn, I have a question. When I decide range of max_depth, I think that required max_depth is different case by case. Because, the number of sample, or features affect to decide max_depth. So, Is there any appropriate criteria for decide range of max_depth, or it's only decided by intuition?
You can try to change the max_depth from case to case and record the performance.
This might help you to get the performance.
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
You may decide a max depth with the tests.
However, if you want to make the max_depth adapted from the tree, You can try to train another learning algorithm with enough data to find it out. (Or simply with a linear regression)
Typically the recommendation is to start with max_depth=3 and then working up from there, which the Decision Tree (DT) documentation covers more in-depth.
Specifically using Ensemble Methods such as RandomForestClassifier or DT Regression is also helpful in determining whether or not max_depth is set to high and/or overfitting.
Related
I am trying to lean SVC classifier of SVM model in sklearn. I have learned to use it on various datasets and even applied gridsearch to improve the results but I have not yet understood some parameters like C, gamma.
If anyone can give me simple but detail explanation of each parameter, it would be great.
Since we are trying to minimize some objective function, we can add some 'size' measure of the coefficient vector itself to the function. C is essentially the inverse of the weight on that 'regularization' term. Decreasing C will prevent overfitting by forcing the coefficients to be sparse or small, depending on the penalty. Increasing C too much will promote underfitting.
Gamma is a parameter for the RBF kernel. Increasing gamma allows for a more complex decision boundary (which can lead to overfitting, but can also improve results--it depends on the data).
This scikit-learn tutorial graphically shows the effect of changing both hyperparameters.
I have a data set containing 1000 points each with 2 inputs and 1 output. It has been split into 80% for training and 20% for testing purpose. I am training it using sklearn support vector regressor. I have got 100% accuracy with training set but results obtained with test set are not good. I think it may be because of overfitting. Please can you suggest me something to solve the problem.
You may be right: if your model scores very high on the training data, but it does poorly on the test data, it is usually a symptom of overfitting. You need to retrain your model under a different situation. I assume you are using train_test_split provided in sklearn, or a similar mechanism which guarantees that your split is fair and random. So, you will need to tweak the hyperparameters of SVR and create several models and see which one does best on your test data.
If you look at the SVR documentation, you will see that it can be initiated using several input parameters, each of which could be set to a number of different values. For the simplicity, let's assume you are only dealing with two parameters that you want to tweak: 'kernel' and 'C', while keeping the third parameter 'degree' set to 4. You are considering 'rbf' and 'linear' for kernel, and 0.1, 1, 10 for C. A simple solution is this:
for kernel in ('rbf', 'linear'):
for c in (0.1, 1, 10):
svr = SVR(kernel=kernel, C=c, degree=4)
svr.fit(train_features, train_target)
score = svr.score(test_features, test_target)
print kernel, c, score
This way, you can generate 6 models and see which parameters lead to the best score, which will be the best model to choose, given these parameters.
A simpler way is to let sklearn to do most of this work for you, using GridSearchCV (or RandomizedSearchCV):
parameters = {'kernel':('linear', 'rbf'), 'C':(0.1, 1, 10)}
clf = GridSearchCV(SVC(degree=4), parameters)
clf.fit(train_features, train_target)
print clf.best_score_
print clf.best_params_
model = clf.best_estimator_ # This is your model
I am working on a little tool to simplify using sklearn for small projects, and make it a matter of configuring a yaml file, and letting the tool do all the work for you. It is available on my github account. You might want to take a look and see if it helps.
Finally, your data may not be linear. In that case you may want to try using something like PolynomialFeatures to generate new nonlinear features based on the existing ones and see if it improves your model quality.
Try fitting your data using training data split Sklearn K-Fold cross-validation, this provides you a fair split of data and better model , though at a cost of performance , which should really matter for small dataset and where the priority is accuracy.
A few hints:
Since you have only two inputs, it would be great if you plot your data. Try either a scatter with alpha = 0.3 or a heatmap.
Try GridSearchCV, as mentioned by #shahins.
Especially, try different values for the C parameter. As mentioned in the docs, if you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
If it's taking too long, you can also try RandomizedSearchCV
As a side note from #shahins answer (I am not allowed to add comments), both implementations are not equivalent. GridSearchCV is better since it performs cross-validation in the training set for tuning the hyperparameters. Do not use the test set for tuning hyperparameters!
Don't forget to scale your data
I'm using GradientBoostingClassifier for my unbalanced labeled datasets. It seems like class weight doesn't exist as a parameter for this classifier in Sklearn. I see I can use sample_weight when fit but I cannot use it when I deal with VotingClassifier or GridSearch. Could someone help?
Currently there isn't a way to use class_weights for GB in sklearn.
Don't confuse this with sample_weight
Sample Weights change the loss function and your score that you're trying to optimize. This is often used in case of survey data where sampling approaches have gaps.
Class Weights are used to correct class imbalances as a proxy for over \ undersampling. There is no direct way to do that for GB in sklearn (you can do that in Random Forests though)
Very late, but I hope it can be useful for other members.
In the article of Zichen Wang in towardsdatascience.com, the point 5 Gradient Boosting it is told:
For instance, Gradient Boosting Machines (GBM) deals with class imbalance by constructing successive training sets based on incorrectly classified examples. It usually outperforms Random Forest on imbalanced dataset For instance, Gradient Boosting Machines (GBM) deals with class imbalance by constructing successive training sets based on incorrectly classified examples. It usually outperforms Random Forest on imbalanced dataset.
And a chart shows that the half of the grandient boosting model have an AUROC over 80%. So considering GB models performances and the way they are done, it seems not to be necessary to introduce a kind of class_weight parameter as it is the case for RandomForestClassifier in sklearn package.
In the book Introduction To Machine Learning with Pyhton written by Andreas C. Müller and Sarah Guido, edition 2017, page 89, Chapter 2 *Supervised Learning, section Ensembles of Decision Trees, sub-section Gradient boosted regression trees (gradient boosting machines):
They are generally a bit more sensitive to
parameter settings than random forests, but can provide better accuracy if the parameters are set correctly.
Now if you still have scoring problems due to imbalance proportions of categories in the target variable, it is possible you should see if your data should be splited to apply different models on it, because they are not as homogeneous as it seems to be. I mean it may have a variable you have not in your dataset train (an hidden variable clearly) that influences a lot the model results, then it is difficult even for the greater GB to give correct scoring because it misses a huge information that you cannot make appear in the matrix to compute sometimes for many reasons.
Some updates:
I found, by random, there are libraries that implement it as parameters of their gradient boosting instance objects. It is the case of H2O where for the parameter balance_classes it is told:
Balance training data class counts via over/under-sampling (for
imbalanced data).
Type: bool (default: False).
If you want to keep with sklearn you should do as HakunaMaData told: over/under-sampling because that's what other libraries finally do when the parameter exist.
When doing decisiontree regression using default parameters, I got R2 value "-1.3". What does it means, is my model OK? The mean square error is also NOT reasonable. Can I make it positive by changing the parameters of the classifiers.
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
from sklearn.metrics import r2_score, mean_squared_error
A negative R2 is indicative of over-fitting, which is pretty typical for an untuned Decision tree fit to small or noisy training data.
You could address this by tuning the parameters of the Decision tree using, e.g. a grid search – setting max_depth to a smaller value will probably cause the model to perform better in your case.
An even better approach would be to change to a Random Forest model, which uses ensembles of decision trees to more automatically correct for such over-fitting (though tuning via grid search is still important to further optimize the results).
I am working with sklearn's implementation of KNN. While my input data has about 20 features, I believe some of the features are more important than others. Is there a way to:
set the feature weights for each feature when "training" the KNN learner.
learn what the optimal weight values are with or without pre-processing the data.
On a related note, I understand generally KNN does not require training but since sklearn implements it using KDTrees, the tree must be generated from the training data. However, this sounds like its turning KNN into a binary tree problem. Is that the case?
Thanks.
kNN is simply based on a distance function. When you say "feature two is more important than others" it usually means difference in feature two is worth, say, 10x difference in other coords. Simple way to achive this is by multiplying coord #2 by its weight. So you put into the tree not the original coords but coords multiplied by their respective weights.
In case your features are combinations of the coords, you might need to apply appropriate matrix transform on your coords before applying weights, see PCA (principal component analysis). PCA is likely to help you with question 2.
The answer to question to is called "metric learning" and currently not implemented in Scikit-learn. Using the popular Mahalanobis distance amounts to rescaling the data using StandardScaler. Ideally you would want your metric to take into account the labels.