Number of hypotheses from a decision tree? - decision-tree

I apologize for asking this question on here but I am confused on an idea about the number of hypotheses that a decision tree represents. It is one hypothesis or is it multiple hypotheses?

No worries!
Decision Tree is a decision support tool which model of decisions and their consequences. Decision Tree Learning produces a predictive model which maps observations from about an item (i.e. price) to item's target value (i.e. purchase or not, the price of house).
Each tree represent a single hypothesis which classifies or regresses examples.

Related

Autoregressive Generalised Additive Models (AR GAMs) and their intepretations

Basically I'm modelling tree fruiting patterns using mgcv bams and auto regressive (1) models have much better outcomes using itsadug::compareML(). (bam AR(1) was chosen due to limitations associated with binomial data) Further, this AR is backed up by biological theory. However, the best models when I use AR techniques often don't include terms that are included in the non-AR models. I understand this to be a common occurrence, the AR term explains much of the variance, leaving smaller predictions for the remaining terms.
I've seen discussions on here warning that AR GAMs should be interpreted with care, and Gavin Simposon's AR GAM post (part 1), ends with hinting that there are some serious diagnostic criteria that should be considered, but part 2 never came out, and I'm struggling to find resources on interpretation. Much more common are simple introductory articles.
I guess the fundamental question is thus: the two different types of model will make different statements about the effects of a given predictor, which should be believed?
If the non-AR model finds that month is a useful predictor, but the AR model finds it ultimately superfluous, does month have an effect? Is month relevant due to effects like light patterns, or just because of correlational structure? I guess this is a classic 'no models are true, some are useful' situation.
This problem persists even inside of a predictor. My temperature:vpds tensor product spline will cite a particular region as increasing the probability in non-AR models, but will suggest another region does so in AR models (in addition to the first).
I'm presently leaning towards including both sets of models in my paper, and noting that the AR models provide better predictions, but the non-AR models can provide insight into the effect of variables. Even then I wonder what's more useful? The model that best fits the data without any AR? Or the non-AR version of the AR model (i.e, set the autocorrelation parameter to 0 keeping the same predictors). I'm leaning towards the former, because I feel strange about the models that have almost no predictors.

Confusion matrix for LDA

I’m trying to check the performance of my LDA model using a confusion matrix but I have no clue what to do. I’m hoping someone can maybe just point my in the right direction.
So I ran an LDA model on a corpus filled with short documents. I then calculated the average vector of each document and then proceeded with calculating cosine similarities.
How would I now get a confusion matrix? Please note that I am very new to the world of NLP. If there is some other/better way of checking the performance of this model please let me know.
What is your model supposed to be doing? And how is it testable?
In your question you haven't described your testable assessment of the model the results of which would be represented in a confusion matrix.
A confusion matrix helps you represent and explore the different types of "accuracy" of a predictive system such as a classifier. It requires your system to make a choice (e.g. yes/no, or multi-label classifier) and you must use known test data to be able to score it against how the system should have chosen. Then you count these results in the matrix as one of the combination of possibilities, e.g. for binary choices there's two wrong and two correct.
For example, if your cosine similarities are trying to predict if a document is in the same "category" as another, and you do know the real answers, then you can score them all as to whether they were predicted correctly or wrongly.
The four possibilities for a binary choice are:
Positive prediction vs. positive actual = True Positive (correct)
Negative prediction vs. negative actual = True Negative (correct)
Positive prediction vs. negative actual = False Positive (wrong)
Negative prediction vs. positive actual = False Negative (wrong)
It's more complicated in a multi-label system as there are more combinations, but the correct/wrong outcome is similar.
About "accuracy".
There are many kinds of ways to measure how well the system performs, so it's worth reading up on this before choosing the way to score the system. The term "accuracy" means something specific in this field, and is sometimes confused with the general usage of the word.
How you would use a confusion matrix.
The confusion matrix sums (of total TP, FP, TN, FN) can fed into some simple equations which give you, these performance ratings (which are referred to by different names in different fields):
sensitivity, d' (dee-prime), recall, hit rate, or true positive rate (TPR)
specificity, selectivity or true negative rate (TNR)
precision or positive predictive value (PPV)
negative predictive value (NPV)
miss rate or false negative rate (FNR)
fall-out or false positive rate (FPR)
false discovery rate (FDR)
false omission rate (FOR)
Accuracy
F Score
So you can see that Accuracy is a specific thing, but it may not be what you think of when you say "accuracy"! The last two are more complex combinations of measure. The F Score is perhaps the most robust of these, as it's tuneable to represent your requirements by combining a mix of other metrics.
I found this wikipedia article most useful and helped understand why sometimes is best to choose one metric over the other for your application (e.g. whether missing trues is worse than missing falses). There are a group of linked articles on the same topic, from different perspectives e.g. this one about search.
This is a simpler reference I found myself returning to: http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/confusion_matrix.html
This is about sensitivity, more from a science statistical view with links to ROC charts which are related to confusion matrices, and also useful for visualising and assessing performance: https://en.wikipedia.org/wiki/Sensitivity_index
This article is more specific to using these in machine learning, and goes into more detail: https://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf
So in summary confusion matrices are one of many tools to assess the performance of a system, but you need to define the right measure first.
Real world example
I worked through this process recently in a project I worked on where the point was to find all of few relevant documents from a large set (using cosine distances like yours). This was like a recommendation engine driven by manual labelling rather than an initial search query.
I drew up a list of goals with a stakeholder in their own terms from the project domain perspective, then tried to translate or map these goals into performance metrics and statistical terms. You can see it's not just a simple choice! The hugely imbalanced nature of our data set skewed the choice of metric as some assume balanced data or else they will give you misleading results.
Hopefully this example will help you move forward.

Updating a Decision Tree With New Data

I am new to decision trees. I am planning to build a large decision tree that I would like to update later with additional data. What is the best approach to this? Can any decision tree be later updated?
Decision trees are most often trained on all available data. That is, when you have new data, you retrain the entire tree. Since this process is very fast it is usually not problematic. If data is too big to fit in memory, you can often get around it by subsampling (row sampling) the training set, since tree-based models don't need that much data to give good results.
Note that decision trees are quite vunerable to overfitting, and you should consider Random Forest or another ensemble method. With bagging it is possible to train different trees on different subsets of data.
There also exists incremental and online learning methods for decision trees. CART, ID3 and VFDT learner are some examples.
see gaenari
it is c++ incremental decision tree.
it continuously insert new chunk dataset, and update.
rebuild can update model when accuracy decreasing(concept drifting).

How to see the column that influences the prediction result the most?

I'm using Azure Machine Learning Studio in order to predict a column using Two-Class Boosted Decision Tree and split data.
The diagram that I have assembled can be found here:
What I need is that I'd like to see the column in the dataset that affects and influences the prediction the most. In other words, the column that changes the prediction result more than the other columns in the dataset.
Sorry if this has been asked before, but I couldn't find a proper answer to this simple question.
As said before, Permutation Feature Importance do the trick. Attach the Permutation Feature Importance block do the train block, click on the output port, and select visualize to get results of the module. The figure above shows the list of features sorted in descending order of their permutation importance scores.
An advice: be careful when interpreting results of permutation score when you have high correlated features.
For more info, see:
https://standupdata.com/category/permutation-feature-importance/ https://gallery.cortanaintelligence.com/Experiment/Permutation-Feature-Importance-5
Most ML implementation for decision tree includes something called "feature importance" in its model. For example, Scikit Learn Decision Tree Classifier has an attribute that indicates the importance of each feature.
Azure ML implementation should be no exception. Please look at the below link Permutation Feature Importance.

Incrementally Trainable Entity Recognition Classifier

I'm doing some semantic-web/nlp research, and I have a set of sparse records, containing a mix of numeric and non-numeric data, representing entities labeled with various features extracted from simple English sentences.
e.g.
uid|features
87w39423|speaker=432, session=43242, sentence=34, obj_called=bob,favorite_color_is=blue
4535k3l535|speaker=512, session=2384, sentence=7, obj_called=tree,isa=plant,located_on=wilson_street
23432424|speaker=997, session=8945305, sentence=32, obj_called=salty,isa=cat,eats=mice
09834502|speaker=876, session=43242, sentence=56, obj_called=the monkey,ate=the banana
928374923|speaker=876, session=43242, sentence=57, obj_called=it,was=delicious
294234234|speaker=876, session=43243, sentence=58, obj_called=the monkey,ate=the banana
sd09f8098|speaker=876, session=43243, sentence=59, obj_called=it,was=hungry
...
A single entity may appear more than once (but with a different UID each time), and may have overlapping features with its other occurrences. A second data set represents which of the above UIDs are definitely the same.
e.g.
uid|sameas
87w39423|234k2j,234l24jlsd,dsdf9887s
4535k3l535|09d8fgdg0d9,l2jk34kl,sd9f08sf
23432424|io43po5,2l3jk42,sdf90s8df
09834502|294234234,sd09f8098
...
What algorithm(s) would I use to incrementally train a classifier that could take a set of features, and instantly recommend the N most similar UIDs and probability of whether or not those UIDs actually represent the same entity? Optionally, I'd also like to get a recommendation of missing features to populate and then re-classify to get a more certain matches.
I researched traditional approximate nearest neighbor algorithms. such as FLANN and ANN, and I don't think these would be appropriate since they're not trainable (in a supervised learning sense) nor are they typically designed for sparse non-numeric input.
As a very naive first-attempt, I was thinking about using a naive bayesian classifier, by converting each SameAs relation into a set of training samples. So, for each entity A with B sameas relations, I would iterate over each and train the classifier like:
classifier = Classifier()
for entity,sameas_entities in sameas_dataset:
entity_features = get_features(entity)
for other_entity in sameas_entities:
other_entity_features = get_features(other_entity)
classifier.train(cls=entity, ['left_'+f for f in entity_features] + ['right_'+f for f in other_entity_features])
classifier.train(cls=other_entity, ['left_'+f for f in other_entity_features] + ['right_'+f for f in entity_features])
And then use it like:
>>> print classifier.findSameAs(dict(speaker=997, session=8945305, sentence=32, obj_called='salty',isa='cat',eats='mice'), n=7)
[(1.0, '23432424'),(0.999, 'io43po5', (1.0, '2l3jk42'), (1.0, 'sdf90s8df'), (0.76, 'jerwljk'), (0.34, 'rlekwj32424'), (0.08, '09843jlk')]
>>> print classifier.findSameAs(dict(isa='cat',eats='mice'), n=7)
[(0.09, '23432424'), (0.06, 'jerwljk'), (0.03, 'rlekwj32424'), (0.001, '09843jlk')]
>>> print classifier.findMissingFeatures(dict(isa='cat',eats='mice'), n=4)
['obj_called','has_fur','has_claws','lives_at_zoo']
How viable is this approach? The initial batch training would be horribly slow, at least O(N^2), but incremental training support would allow updates to happen more quickly.
What are better approaches?
I think this is more of a clustering than a classification problem. Your entities are data points and the sameas data is a mapping of entities to clusters. In this case, clusters are the distinct 'things' your entities refer to.
You might want to take a look at semi-supervised clustering. A brief google search turned up the paper Active Semi-Supervision for Pairwise Constrained Clustering which gives pseudocode for an algorithm that is incremental/active and uses supervision in the sense that it takes training data indicating which entities are or are not in the same cluster. You could derive this easily from your sameas data, assuming that - for example - uids 87w39423 and 4535k3l535 are definitely distinct things.
However, to get this to work you need to come up with a distance metric based on the features in the data. You have a lot of options here, for example you could use a simple Hamming distance on the features, but the choice of metric function here is a little bit arbitrary. I'm not aware of any good ways of choosing the metric, but perhaps you have already looked into this when you were considering nearest neighbour algorithms.
You can come up with confidence scores using the distance metric from the centres of the clusters. If you want an actual probability of membership then you would want to use a probabilistic clustering model, like a Gaussian mixture model. There's quite a lot of software to do Gaussian mixture modelling, I don't know of any that is semi-supervised or incremental.
There may be other suitable approaches if the question you wanted to answer was something like "given an entity, which other entities are likely to refer to the same thing?", but I don't think that is what you are after.
You may want to take a look at this method:
"Large Scale Online Learning of Image Similarity Through Ranking" Gal Chechik, Varun Sharma, Uri Shalit and Samy Bengio, Journal of Machine Learning Research (2010).
[PDF] [Project homepage]
More thoughts:
What do you mean by 'entity'? Is entity the thing that is referred by 'obj_called'? Do you use the content of 'obj_called' to match different entities, e.g. 'John' is similar to 'John Doe'? Do you use proximity between sentences to indicate similar entities? What is the greater goal (task) of the mapping?

Resources