When I use XGBoost to fit a model, it usually shows a list of messages like "updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=5". I wonder how XGBoost is performing the tree pruning? I cannot find the description about their pruning process in their paper.
Note: I do understand the decision tree pruning process e.g. pre-pruning and post-pruning. Here I am curious about the actual pruning process of XGBoost. Usually pruning requires a validation data, but XGBoost performs the pruning even when I do not give it any validation data.
XGBoost grows all trees to max_depth first.
This allows for fast training as you don't have to evaluate all the regularization parameters at each node.
After each tree is grown to max_depth, you walk from the bottom of the tree ( recursively all the way to the top ) and determine whether the split and children are valid based on the hyper-parameters you selected. If the split or nodes are not valid, they are removed from the tree.
In the model dump of an XGBoost model you can observe the actual depth will be less than the max_depth during training if pruning has occurred.
Pruning requires no validation data. It is only asking a simple question as to whether the split, or resulting child nodes are valid, based on the hyper-parameters you have set during training.
more info on pruning the original XGB slides from 2014
Related
In a random forest regressor from Scikit Learn it is possible to set a ccp_alpha parameter that is related to the pruning technique (docs) and I'm using it to control my overfitting.
After applying it I would like to use this pruned model to perform hyperparameter tuning with random Search and find my best model. So, I want this pruned model.
Is it possible to get this pruned model?
When you apply the .fit(X_train, y_train) function to an object of the RandomForestClassifier() or RandomForestRegressor() class, the returned fitted model has already been pruned.
This happens under the hood in the sklearn implementation. Theoretically speaking too, a RandomForest is not just a combination of DecisionTrees, but is the pruned, aggregated, and using default settings, bootstrapped version of multiple large decision trees.
Rest assured, the model returned here is not overfitting due to pruning. If you do notice overfitting, I'd suggest you check the o.o.b score of your model and describe your entire data pipeline for further suggestions
Refer to this documentation from scikit-learn
https://scikitlearn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html
It includes a detailed explanantion of implementing pruning using cost-complexity.
I'm using CatBoostClassifier's eval_metrics to compute some metrics on my test set, and I'm confused about its output. For a given metric, by default, it seems to return an array, of size equaling the number of iterations.
This seems to be inconsistent with the predict function, which returns a single value only. Which number in the array returned by eval_metrics is consistent with the predict function?
I checked the documentation at https://catboost.ai/docs/concepts/python-reference_catboostclassifier_eval-metrics.html#python-reference_catboostclassifier_eval-metrics__output-format, but it's still not clear to me.
The Catboost classifier is a type of Ensemble Classifiers which uses Boosting Methods. Simply put, Boosting algorithms iteratively train weaker algorithms (Decision Trees in this case) to make predictions. Each Tree that is created learns from the collective errors that the previous weaker trees made and tries to learn from those errors. Catboost is based on Gradient Boosting which I won't dwell to deep into. What is relevant here is that a number of weaker trees are generated in the process, and when you call the eval_metrics() method you are getting the eval metric for each of the generated trees. You specify the number of trees generated when you provided iterations, num_boost_round, n_estimators or num_trees when creating the model (If not specified it has a default value of 1000).
The other arguments you specify to the eval_metrics() method will define the range of trees taken ntree_start to ntree_end at intervals of eval_period. If these aren't provided you will get the specified metrics for your data for all of the weaker trees generated, which is why you get a list of values.
I am new to decision trees. I am planning to build a large decision tree that I would like to update later with additional data. What is the best approach to this? Can any decision tree be later updated?
Decision trees are most often trained on all available data. That is, when you have new data, you retrain the entire tree. Since this process is very fast it is usually not problematic. If data is too big to fit in memory, you can often get around it by subsampling (row sampling) the training set, since tree-based models don't need that much data to give good results.
Note that decision trees are quite vunerable to overfitting, and you should consider Random Forest or another ensemble method. With bagging it is possible to train different trees on different subsets of data.
There also exists incremental and online learning methods for decision trees. CART, ID3 and VFDT learner are some examples.
see gaenari
it is c++ incremental decision tree.
it continuously insert new chunk dataset, and update.
rebuild can update model when accuracy decreasing(concept drifting).
I don't understand why I have a feature named BIAS in the contributing features.
I read the doc and I find
" In each column there are features and their weights. Intercept
(bias) feature is shown as in the same table "
But I don't understand what intercepting bias mean here.
Thank you for your help :)
This is related to the way ELI5 computes the weights.
XGBoost outputs scores only for leaves (you can see it via booster.dump_model(…, with_stats=True)), so the XGBoost explainer implementation in ELI5 starts reconstructing pseudo leaves scores for every node across all the trees. These pseudo leaves scores are basically the average leaf score you would expect if stopping the tree at this node level, thus the average of all children leaves weighted by their cover in the training set.
This algorithm also applies to the root nodes of the trees, which are similarly assigned pseudo leaves scores. At the root node level, this score is the average score you may end up going through the tree. Summed across all the trees, this sum of all root nodes scores is the average score you may get going through all the trees (the one that will be applied a sigmoid to translate into a probability). This is what ELI5 puts into <BIAS>.
So you can understand <BIAS> as the expected average score output by the model, based on the distribution of the training set.
The <BIAS> will change if you modify your base_score parameter (for instance in the case of an imbalanced binary classification, you may change the default 0.5 to something closer to your target rate, and the <BIAS> should get closer to 0).
EDIT: maybe it's clearer with the visual explanation from this blog (baseline is equivalent to <BIAS>) https://medium.com/applied-data-science/new-r-package-the-xgboost-explainer-51dd7d1aa211
It would be good to get some tips on tuning Apache Spark for Random Forest classification.
Currently, we have a model that looks like:
featureSubsetStrategy all
impurity gini
maxBins 32
maxDepth 11
numberOfClasses 2
numberOfTrees 100
We are running Spark 1.5.1 as a standalone cluster.
1 Master and 2 Worker nodes.
The amount of RAM is 32GB on each node with 4 Cores.
The classification takes 440ms.
When we increase the number of trees to 500, it takes 8 sec already.
We tried to reduce the depth but then error rate is higher. We have around 246 attributes.
Probably we are doing something wrong. Any ideas how we could improve the performance ?
Increasing the number of decision trees will definitely increase the prediction time, as the problem instance has to traverse through all the trees. But reducing it is no good for prediction accuracy. You have to vary this parameter (number of decision trees) and find an optimal value. That is why it is called hyper-parameter. Hyper parameters are highly dependent on the nature of your data and attributes. You may need to vary other hyper-parameters as well, one by one, and achieve global optimum.
Also, when you say prediction time, are you including the time to load the model as well ! If so, I guess the model time should not be considered for prediction time. This is only an overhead for loading your model and preparing the application for prediction.