Change depth of random forest ex-post? - scikit-learn

I have been using the scikit-learn RandomForestClassifier for a project. I am wondering if there's any tool or trick to grow a "full" forest, and then experiment with certain hyperparameters ex-post. For example, is there a way to quickly chop every fully grown tree to a depth of, say, 10 to check its performance on a test set? I imagine that this could be done in a computationally feasible manner for any hyperparameter that limits tree depth (e.g. max_depth, min_sample_leaf, min_sample_split).
I currently use GridSearchCV to find the best configuration of max_depth, max_features and max_samples, but with three hyperparameters, it takes a long time to search the space.

Related

XGBOOST faster than random forest?

I am doing kaggle inclass challege of bosten hosing prices and learnt that XGBoost is faster than RandomForest but when implemented was slower.i Want to ask when XGBoost becomes faster and when RandomForest??.I am new to machine learning and need your help.Thanking in advance
Mainly, the parameters you choose have strong impact in the speed of your algorithm, (e.g learning rate, depth of the tree, number of features etc.), there's a trade-off between accuracy and speed, so i suggest you put the parameters you've chosen for every model and see how to change it to get faster performance with reasonable accuracy.

Why does more features in a random forest decrease accuracy dramatically?

I am using sklearn's random forests module to predict values based on 50 different dimensions. When I increase the number of dimensions to 150, the accuracy of the model decreases dramatically. I would expect more data to only make the model more accurate, but more features tend to make the model less accurate.
I suspect that splitting might only be done across one dimension which means that features which are actually more important get less attention when building trees. Could this be the reason?
Yes, the additional features you have added might not have good predictive power and as random forest takes random subset of features to build individual trees, the original 50 features might have got missed out. To test this hypothesis, you can plot variable importance using sklearn.
Your model is overfitting the data.
From Wikipedia:
An overfitted model is a statistical model that contains more parameters than can be justified by the data.
https://qph.fs.quoracdn.net/main-qimg-412c8556aacf7e25b86bba63e9e67ac6-c
There are plenty of illustrations of overfitting, but for instance, this 2d plot represents the different functions that would have been learned for a binary classification task. Because the function on the right has too many parameters, it learns wrongs data patterns that don't generalize properly.

Decide max_depth of DecisionTreeClassifier in sklearn

When I tuning Decision Tree using GridSearchCV in skelarn, I have a question. When I decide range of max_depth, I think that required max_depth is different case by case. Because, the number of sample, or features affect to decide max_depth. So, Is there any appropriate criteria for decide range of max_depth, or it's only decided by intuition?
You can try to change the max_depth from case to case and record the performance.
This might help you to get the performance.
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
You may decide a max depth with the tests.
However, if you want to make the max_depth adapted from the tree, You can try to train another learning algorithm with enough data to find it out. (Or simply with a linear regression)
Typically the recommendation is to start with max_depth=3 and then working up from there, which the Decision Tree (DT) documentation covers more in-depth.
Specifically using Ensemble Methods such as RandomForestClassifier or DT Regression is also helpful in determining whether or not max_depth is set to high and/or overfitting.

Awkward results from sklearn.metrics

When doing decisiontree regression using default parameters, I got R2 value "-1.3". What does it means, is my model OK? The mean square error is also NOT reasonable. Can I make it positive by changing the parameters of the classifiers.
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
from sklearn.metrics import r2_score, mean_squared_error
A negative R2 is indicative of over-fitting, which is pretty typical for an untuned Decision tree fit to small or noisy training data.
You could address this by tuning the parameters of the Decision tree using, e.g. a grid search – setting max_depth to a smaller value will probably cause the model to perform better in your case.
An even better approach would be to change to a Random Forest model, which uses ensembles of decision trees to more automatically correct for such over-fitting (though tuning via grid search is still important to further optimize the results).

RandomForestClassifier vs ExtraTreesClassifier in scikit learn

Can anyone explain the difference between the RandomForestClassifier and ExtraTreesClassifier in scikit learn. I've spent a good bit of time reading the paper:
P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006
It seems these are the difference for ET:
1) When choosing variables at a split, samples are drawn from the entire training set instead of a bootstrap sample of the training set.
2) Splits are chosen completely at random from the range of values in the sample at each split.
The result from these two things are many more "leaves".
Yes both conclusions are correct, although the Random Forest implementation in scikit-learn makes it possible to enable or disable the bootstrap resampling.
In practice, RFs are often more compact than ETs. ETs are generally cheaper to train from a computational point of view but can grow much bigger. ETs can sometime generalize better than RFs but it's hard to guess when it's the case without trying both first (and tuning n_estimators, max_features and min_samples_split by cross-validated grid search).
ExtraTrees classifier always tests random splits over fraction of features (in contrast to RandomForest, which tests all possible splits over fraction of features)
The main difference between random forests and extra trees (usually called extreme random forests) lies in the fact that, instead of computing the locally optimal feature/split combination (for the random forest), for each feature under consideration, a random value is selected for the split (for the extra trees). Here is a good resource to know more about their difference in more detail Random forest vs extra tree.

Resources