Finding variables that contributes the most for a decision tree prediction in H2o - decision-tree

How to find the variables that contributes the most for a particular prediction in case of a decision tree? For eg. If there are features A,B,C,D,E and we build a decision tree on top of the dataset. Then for a sample x, lets say variables C,D contributes the most to the prediction(x). How to find these variables that contributed the most for prediction(x) in H2O? I know the H2O gives global importance of variables once the decision tree is built. My question applies in the case when we are using that particular tree to make a decision and in finding the variables that contributed to that particular decision. Scikit learn has functions to extract the rules that were used to predict a sample. Does H2O have any such functionality?

H2O has no support for doing this currently (as of Feb 2017, h2o 3.10.3.x); Erin opened a JIRA for it: https://0xdata.atlassian.net/browse/PUBDEV-4007

Related

Incremental learning - Set Initial Weights or values for Parameters from previous model for ML algorithm in Spark 2.0

I am trying for setting the initial weights or parameters for a machine learning (Classification) algorithm in Spark 2.x. Unfortunately, except for MultiLayerPerceptron algorithm, no other algorithm is providing a way to set the initial weights/parameter values.
I am trying to solve Incremental learning using spark. Here, I need to load old model re-train the old model with new data in the system. How can I do this?
How can I do this for other algorithms like:
Decision Trees
Random Forest
SVM
Logistic Regression
I need to experiment multiple algorithms and then need to choose the best performing one.
How can I do this for other algorithms like:
Decision Trees
Random Forest
You cannot. Tree based algorithms are not well suited for incremental learning, as they look at the global properties of the data and have no "initial weights or values" that can be used to bootstrap the process.
Logistic Regression
You can use StreamingLogisticRegressionWithSGD which exactly implements required process, including setting initial weights with setInitialWeights.
SVM
In theory it could be implemented similarly to streaming regression StreamingLogisticRegressionWithSGD or StreamingLinearRegressionWithSGD, by extending StreamingLinearAlgorithm, but there is no such implementation built-in, ans since org.apache.spark.mllib is in a maintanance mode, there won't be.
It's not based on spark, but there is a C++ incremental decision tree.
see gaenari.
Continuous chunking data can be inserted and updated, and rebuilds can be run if concept drift reduces accuracy.

What algorithm is used in spark decision tree (is ID3, C4.5 or CART)

I have a question about decision tree in MLlib. What algorithm is used in Spark? Is it ID3, C4.5 or CART?
Spark MLlib is using the ID3 algorithm with CART.
ID3 only handles categorical variables and CART can handle continuous variables. Spark decision trees can handle categorical variables, so it is using CART (in the Jira ticket specified below we can see that they haven't implemented C4.5 yet).
In this blog post you can find some information about the different algorithms and it is where I got the answer from.
You can find a discussion on extending it to C4.5 in this Jira ticket.
More information about the difference between the algorithms here.
If you take a look at the link Apache Spark and take a look at the section,
Node impurity and information gain (Basic Algorithm)
You can find
The current implementation provides two impurity measures for classification (Gini impurity and entropy) and one impurity measure for regression (variance)
Also, if you take a look at the link Decision Tree, you can find CART (classification and regression tree) algorithm uses Gini impurity and entropy for classification and variance reduction for regression.

How train a classifier on different feature types together? Like String,numeric,Categorical, timestamp etc

I am a newbie in field of machine Learning. I have taken Udacity's "Introduction to Machine Learning" course. So I know running basic classifiers using sklearn and python. But all the classifiers they taught in the course was trained on a single data type.
I have a problem wherein I want to classify a code commit as "clean" or "buggy".
I have a feature set which contains String data (like name of person), Categorical data (say "clean" vs "buggy"), numeric data (like no. of commits) and timestamp data (like time of commit). How can I train a classifier based on these three features simultaneously. Lets assuming that I plan on using a Naive Bayes classifier and sklearn. Please Help!
I am trying to implement the paper. Any help would really be appreciable.
Many machine learning classifiers like logistic regression, random forest, decision trees and SVM work fine with both continuous and categorical features. My guess is that you have two paths to follow. The first one is data pre-processing. For example, convert all string/cateogorical data (name of a person) to integers or you can use ensemble learning.
Ensemble learning is when you combine different classifiers (each one dealing with one kind of heterogeneous feature) using majority vote, for example, so they can find a consensus in classification. Hope it helps.

Azure Machine Learning Decision Tree output

Is there any way to get the output of the Boosted Decision Tree module in ML Studio? To analyze the learned tree, like in Weka.
Update: visualization of decision trees is available now! Right-click on the output node of the "Train Model" module and select "Visualize".
My old answer:
I'm sorry; visualization of decision trees isn't available yet. (I really want it too! You can upvote this feature request at http://feedback.azure.com/forums/257792-machine-learning/suggestions/7419469-show-variable-importance-after-experiment-runs, but they are currently working on it.)
Just FYI, you can currently see what the model builds for linear algorithms by right-clicking on the "Train Model" module output node and selecting "Visualize". It will show the initial parameter values and the feature weights. But for non-linear algorithms like decision trees, that visibility is still forthcoming.
Yes, I don't know your structure but you should have your dataset and the algorithm going into a train model and put the results of the train model with your other half of the dataset (if you used split) into a score model. You can see the scored label and scored probabilities here when you press visualise
Your experiment should look a bit like this. Connect the boosted decision tree with the dataset to a train model, you can see the results in the score model

How to control feature subsetting in random forest in scikit-learn?

I am trying to change the way that random forest algorithm using in subsetting features for every node. The original algorithm as it is implemented in Scikit-learn way is randomly subsetting. I want to define which subset for every new node from several choices of several subsets. Is there direct way in scikit-learn to control such method? If not, is there any way to update the same code of Scikit-learn? If yes, which function in the source code is what you think should be updated?
Short version: This is all you.
I assume by "subsetting features for every node" you are referring to the random selection of a subset of samples and possibly features used to train individual trees in the forest. If that's what you mean, then you aren't building a random forest; you want to make a nonrandom forest of particular trees.
One way to do that is to build each DecisionTreeClassifier individually using your carefully specified subset of features, then use the VotingClassifier to combine the trees into a forest. (That feature is only available in 0.17/dev, so you may have to build your own, but it is super simple to build a voting classifier estimator class.)

Resources