Azure Machine Learning Decision Tree output - azure

Is there any way to get the output of the Boosted Decision Tree module in ML Studio? To analyze the learned tree, like in Weka.

Update: visualization of decision trees is available now! Right-click on the output node of the "Train Model" module and select "Visualize".
My old answer:
I'm sorry; visualization of decision trees isn't available yet. (I really want it too! You can upvote this feature request at http://feedback.azure.com/forums/257792-machine-learning/suggestions/7419469-show-variable-importance-after-experiment-runs, but they are currently working on it.)
Just FYI, you can currently see what the model builds for linear algorithms by right-clicking on the "Train Model" module output node and selecting "Visualize". It will show the initial parameter values and the feature weights. But for non-linear algorithms like decision trees, that visibility is still forthcoming.

Yes, I don't know your structure but you should have your dataset and the algorithm going into a train model and put the results of the train model with your other half of the dataset (if you used split) into a score model. You can see the scored label and scored probabilities here when you press visualise
Your experiment should look a bit like this. Connect the boosted decision tree with the dataset to a train model, you can see the results in the score model

Related

Updating a Decision Tree With New Data

I am new to decision trees. I am planning to build a large decision tree that I would like to update later with additional data. What is the best approach to this? Can any decision tree be later updated?
Decision trees are most often trained on all available data. That is, when you have new data, you retrain the entire tree. Since this process is very fast it is usually not problematic. If data is too big to fit in memory, you can often get around it by subsampling (row sampling) the training set, since tree-based models don't need that much data to give good results.
Note that decision trees are quite vunerable to overfitting, and you should consider Random Forest or another ensemble method. With bagging it is possible to train different trees on different subsets of data.
There also exists incremental and online learning methods for decision trees. CART, ID3 and VFDT learner are some examples.
see gaenari
it is c++ incremental decision tree.
it continuously insert new chunk dataset, and update.
rebuild can update model when accuracy decreasing(concept drifting).

Confusion Matrix - Not changing with predictive models (Sklearn)

I have 3 predictive models and I am evaluating there performance with a confusion matrix.
I am getting the same results for the confusion matrix for each of the 3 models.
I expect that the different models would perform differently and produce different confusion matrices. I am new to predictive modelling, so I suspect I am making a "Rooky mistake" . The full script I am using is sitting in a Jupyter notebook on GiThub here
A screenshot of the code for the 3 models is below
Can some one point out what is going wrong?
Cheers
Mike
As mentioned: make predictions on the test data. But keep in mind that your targets are skewed! So use StratifiedKFolds or something like this.
Also I guess that your data is a bit corrupted. While all models show the same result it may be a big mistake underneath.
Few questions/advises:
1. Did you scale your data?
2. Did you use one-hot-encoding?
2. Use don't Decision Trees but Forests/XGBoost. Easy to overfit with DT.
3. Don't use >2 hidden layers in NN because it's easy to overfit too. Use 2 firstly. And your architecture (30, 30, 30) with 2 target classes seems weird.
4. And if you wish to use >2 hidden layers - go to Keras or TF. You'll find there many features that can help you to not overfit.
That is simply because you are using the same Training data to make predictions. Since your models are already trained on the same data that you are making the predictions on, they will return the same results (and ultimately the same confusion matrix). You need to split your dataset into training and test sets. Then train your classifier on training set and make predictions on test set.
You can use train_test_split in Sklearn to split your dataset into training or test set.

How to see the column that influences the prediction result the most?

I'm using Azure Machine Learning Studio in order to predict a column using Two-Class Boosted Decision Tree and split data.
The diagram that I have assembled can be found here:
What I need is that I'd like to see the column in the dataset that affects and influences the prediction the most. In other words, the column that changes the prediction result more than the other columns in the dataset.
Sorry if this has been asked before, but I couldn't find a proper answer to this simple question.
As said before, Permutation Feature Importance do the trick. Attach the Permutation Feature Importance block do the train block, click on the output port, and select visualize to get results of the module. The figure above shows the list of features sorted in descending order of their permutation importance scores.
An advice: be careful when interpreting results of permutation score when you have high correlated features.
For more info, see:
https://standupdata.com/category/permutation-feature-importance/ https://gallery.cortanaintelligence.com/Experiment/Permutation-Feature-Importance-5
Most ML implementation for decision tree includes something called "feature importance" in its model. For example, Scikit Learn Decision Tree Classifier has an attribute that indicates the importance of each feature.
Azure ML implementation should be no exception. Please look at the below link Permutation Feature Importance.

How train a classifier on different feature types together? Like String,numeric,Categorical, timestamp etc

I am a newbie in field of machine Learning. I have taken Udacity's "Introduction to Machine Learning" course. So I know running basic classifiers using sklearn and python. But all the classifiers they taught in the course was trained on a single data type.
I have a problem wherein I want to classify a code commit as "clean" or "buggy".
I have a feature set which contains String data (like name of person), Categorical data (say "clean" vs "buggy"), numeric data (like no. of commits) and timestamp data (like time of commit). How can I train a classifier based on these three features simultaneously. Lets assuming that I plan on using a Naive Bayes classifier and sklearn. Please Help!
I am trying to implement the paper. Any help would really be appreciable.
Many machine learning classifiers like logistic regression, random forest, decision trees and SVM work fine with both continuous and categorical features. My guess is that you have two paths to follow. The first one is data pre-processing. For example, convert all string/cateogorical data (name of a person) to integers or you can use ensemble learning.
Ensemble learning is when you combine different classifiers (each one dealing with one kind of heterogeneous feature) using majority vote, for example, so they can find a consensus in classification. Hope it helps.

Setting feature weights for KNN

I am working with sklearn's implementation of KNN. While my input data has about 20 features, I believe some of the features are more important than others. Is there a way to:
set the feature weights for each feature when "training" the KNN learner.
learn what the optimal weight values are with or without pre-processing the data.
On a related note, I understand generally KNN does not require training but since sklearn implements it using KDTrees, the tree must be generated from the training data. However, this sounds like its turning KNN into a binary tree problem. Is that the case?
Thanks.
kNN is simply based on a distance function. When you say "feature two is more important than others" it usually means difference in feature two is worth, say, 10x difference in other coords. Simple way to achive this is by multiplying coord #2 by its weight. So you put into the tree not the original coords but coords multiplied by their respective weights.
In case your features are combinations of the coords, you might need to apply appropriate matrix transform on your coords before applying weights, see PCA (principal component analysis). PCA is likely to help you with question 2.
The answer to question to is called "metric learning" and currently not implemented in Scikit-learn. Using the popular Mahalanobis distance amounts to rescaling the data using StandardScaler. Ideally you would want your metric to take into account the labels.

Resources