how to explain this decision tree interpretability question? - decision-tree

enter image description here
enter image description here
The 2 pictures above shown the 2 decision tree....
Question is: It is often claimed that a strength of decision trees is their interpretability.
Is this always justified? Refer to Figures 5 and 6 to help with your answer.

I think the point of the question is saying that a decision tree is interpretable if its depth is relatively small. The second tree is very deep i.e for one single prediction, you will get a high number of different splitting decisions to process. Therefore, you lose interpretability because the explanation for any prediction is an intersection of too many conditions for a human-user to process.

Related

What process does happen in 'Input Transform' in PointNet architecture?

I am reading a papers to understand the method which convert the raw point cloud data into machine learning readable dataset. Here I would like to ask you one question that I have in the research paper PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. I want to understand that in the PointNet architecture (shown in Picture below), in the first step, after taking the raw point cloud data into the algorithm, data goes into 'Input transform' part where some process happens in T-Net (Transformation network) and matrix multiplication. My question is 'What does happen in the 'Input Transform' and 'Feature transform' part? what is the input data and what is the output data? Kindly give an explanation about this as that was my main question.
You can find the research paper by the doi: 10.1109/CVPR.2017.16
I'm trying to work this out as well, consider this an incomplete answer. I think the Input transformer with a 3x3 matrix acts to spacially transform (via some affine transformation) the nx3 inputs (3 dimensional think x,y,z). Intuitively you may think of it this way: say you give it a rotated object (say an upside down chair), it would de-rotate the object to a canonical representation (an upright chair). Its a 3x3 matrix to preserve the dimensionality of the input. That way the input become invarient to changes of pose (perspective). After this the shared mlps (essentially a 1x1 conv) increase the number of features from nx3 to (nx64), the next T-net does the same as in the other example, it moves the higher dimensional feature space into a canonical form. As to exactly how the box works Im reading the code and will let you know.

What is the difference between growing a tree based learning algorithm vertically as compared to horizontally?

I came across the tree based algorithm Light GBM and I have read that it grows the trees vertically meaning that the Light GBM grows tree leaf-wise (while some other algorithms grow level-wise). I was just wondering and thinking: What is the advantage of growing a tree vertically? Are there any?
A difference (not necessarily an advantage) which I can see is the way you need to define early-stopping criteria while growing the tree. Any thoughts on this?
As described in this section of LightGBM's documentation
LightGBM uses leaf-wise (or what XGBoost calls lossguide) tree growth because it can achieve lower loss (i.e. better fit to the training data) than depth-wise tree growth, holding the number of leaves constant.
In leaf-wise tree growth, the split with the largest gain is chosen, regardless of its level of depth.
A difference ... I can see is the way you need to define early-stopping criteria while growing the tree
It's true that in this type of tree growth, you now have to consider two closely-related ways to prevent overfitting:
maximum depth (max_depth in LightGBM)
total allowed number of leaves (num_leaves in LightGBM)
I'm assuming this is what you meant by "early-stopping criteria", but wanted to also note that the phrase "early stopping" has a special meaning in GBMs that isn't related to how individual trees are grown. Early stopping, as XGBoost, LightGBM, and other GBM libraries refer to it, means "if performance on held-out data fails to improve for n iterations, stop training".

Simple Binary Text Classification

I seek the most effective and simple way to classify 800k+ scholarly articles as either relevant (1) or irrelevant (0) in relation to a defined conceptual space (here: learning as it relates to work).
Data is: title & abstract (mean=1300 characters)
Any approaches may be used or even combined, including supervised machine learning and/or by establishing features that give rise to some threshold values for inclusion, among other.
Approaches could draw on the key terms that describe the conceptual space, though simple frequency count alone is too unreliable. Potential avenues might involve latent semantic analysis, n-grams, ..
Generating training data may be realistic for up to 1% of the corpus, though this already means manually coding 8,000 articles (1=relevant, 0=irrelevant), would that be enough?
Specific ideas and some brief reasoning are much appreciated so I can make an informed decision on how to proceed. Many thanks!
Several Ideas:
Run LDA and get document-topic and topic-word distributions say (20 topics depending on your dataset coverage of different topics). Assign the top r% of the documents with highest relevant topic as relevant and low nr% as non-relevant. Then train a classifier over those labelled documents.
Just use bag of words and retrieve top r nearest negihbours to your query (your conceptual space) as relevant and borrom nr percent as not relevant and train a classifier over them.
If you had the citations you could run label propagation over the network graph by labelling very few papers.
Don't forget to make the title words different from your abstract words by changing the title words to title_word1 so that any classifier can put more weights on them.
Cluster the articles into say 100 clusters and then choose then manually label those clusters. Choose 100 based on the coverage of different topics in your corpus. You can also use hierarchical clustering for this.
If it is the case that the number of relevant documents is way less than non-relevant ones, then the best way to go is to find the nearest neighbours to your conceptual space (e.g. using information retrieval implemented in Lucene). Then you can manually go down in your ranked results until you feel the documents are not relevant anymore.
Most of these methods are Bootstrapping or Weakly Supervised approaches for text classification, about which you can more literature.

Supervised Learning for User Behavior over Time

I want to use machine learning to identify the signature of a user who converts to a subscriber of a website given their behavior over time.
Let's say my website has 6 different features which can be used before subscribing and users can convert to a subscriber at any time.
For a given user I have stats which represent the intensity on a continuous range of that user's interaction with features 1-6 on a daily basis so:
D1: f1,f2,f3,f4,f5,f6
D2: f1,f2,f3,f4,f5,f6
D3: f1,f2,f3,f4,f5,f6
D4: f1,f2,f3,f4,f5,f6
Let's say on day 5, the user converts.
What machine using algorithms would help me identify which are the most common patterns in feature usage which lead to a conversion?
(I know this is a super basic classification question, but I couldn't find a good example using longitudinal data, where input vectors are ordered by time like I have)
To develop the problem further, let's assume that each feature has 3 intensities at which the user can interact (H, M, L).
We can then represent each user as a string of states of interaction intensity. So, for a user:
LLLLMM LLMMHH LLHHHH
Would mean on day one they only interacted significantly with features 5 and 6, but by the third day they were interacting highly with features 3 through 6.
N-gram Style
I could make these states words and the lifetime of a user a sentence. (Would probably need to add a "conversion" word to the vocabulary as well)
If I ran these "sentences" through an n-gram model, I could get the likely future state of a user given his/her past few state which is somewhat interesting. But, what I really want to know the most common sets of n-grams that lead to the conversion word. Rather than feeding in an n-gram and getting the next predicted word, I want to give the predicted word and get back the 10 most common n-grams (from my data) which would be likely to lead to the word.
Amaç Herdağdelen suggests identifying n-grams to practical n and then counting how many n-gram states each user has. Then correlating with conversion data (I guess no conversion word in this example). My concern is that there would be too many n-grams to make this method practical. (if each state has 729 possibilities, and we're using trigrams, thats a lot of possible trigrams!)
Alternatively, could I just go thru the data logging the n-grams which led to the conversion word and then run some type of clustering on them to see what the common paths are to a conversion?
Survival Style
Suggested by Iterator, I understand the analogy to a survival problem, but the literature here seems to focus on predicting time to death as opposed to the common sequence of events which leads to death. Further, when looking up the Cox Proportional Hazard model, I found that it does not event accommodate variables which change over time (its good for differentiating between static attributes like gender and ethnicity)- so it seems very much geared toward a different question than mine.
Decision Tree Style
This seems promising though I can't completely wrap my mind around how to structure the data. Since the data is not flat, is the tree modeling the chance of moving from one state to another down the line and when it leads to conversion or not? This is very different than the decision tree data literature I've been able to find.
Also, need clarity on how to identify patterns which lead to conversion instead a models predicts likely hood of conversion after a given sequence.
Theoretically, hidden markov models may be a suitable solution to your problem. The features on your site would constitute the alphabet, and you can use the sequence of interactions as positive or negative instances depending on whether a user finally subscribed or not. I don't have a guess about what the number of hidden states should be, but finding a suitable value for that parameter is part of the problem, after all.
As a side note, positive instances are trivial to identify, but the fact that a user has not subscribed so far doesn't necessarily mean s/he won't. You might consider to limit your data to sufficiently old users.
I would also consider converting the data to fixed-length vectors and apply conceptually simpler models that could give you some intuition about what's going on. You could use n-grams (consecutive interaction sequences of length n).
As an example, assuming that the interaction sequence of a given user ise "f1,f3,f5", "f1,f3,f5" would constitute a 3-gram (trigram). Similarly, for the same user and the same interaction sequence you would have "f1,f3" and "f3,f5" as the 2-grams (bigrams). In order to represent each user as a vector, you would identify all n-grams up to a practical n, and count how many times the user employed a given n-gram. Each column in the vector would represent the number of times a given n-gram is observed for a given user.
Then -- probably with the help of some suitable normalization techniques such as pointwise mutual information or tf-idf -- you could look at the correlation between the n-grams and the final outcome to get a sense of what's going on, carry out feature selection to find the most prominent sequences that users are involved in, or apply classification methods such as nearest neighbor, support machine or naive Bayes to build a predictive model.
This is rather like a survival analysis problem: over time the user will convert or will may drop out of the population, or will continue to appear in the data and not (yet) fall into neither camp. For that, you may find the Cox proportional hazards model useful.
If you wish to pursue things from a different angle, namely one more from the graphical models perspective, then a Kalman Filter may be more appealing. It is a generalization of HMMs, suggested by #AmaçHerdağdelen, which work for continuous spaces.
For ease of implementation, I'd recommend the survival approach. It is the easiest to analyze, describe, and improve. After you have a firm handle on the data, feel free to drop in other methods.
Other than Markov chains, I would suggest decision trees or Bayesian networks. Both of these would give you a likely hood of a user converting after a sequence.
I forgot to mention this earlier. You may also want to take a look at the Google PageRank algorithm. It would help you account for the user completely disappearing [not subscribing]. The results of that would help you to encourage certain features to be used. [Because they're more likely to give you a sale]
I think Ngramm is most promising approach, because all sequnce in data mining are treated as elements depndent on few basic steps(HMM, CRF, ACRF, Markov Fields) So I will try to use classifier based on 1-grams and 2 -grams.

2D shape optimization through genetic algorithms

I just recently started learning about genetic algorithms and am now trying to implement them in 2D shape optimization in physics simulaiton. The simulation produces a single scalar for each shape. (I guess this is kind of similar to boxcar2d http://boxcar2d.com/)
The 2D shapes are actually the union of several 2D "sub shapes." Each subshape is stored as an list of angles/radii. The 2D shape is then stored as a list of subshape lists. This serves as my chromosone right now.
Right now for fitness, I will probably use the scalar the simulation produced. My question is, how should I go about the selection and reproduction process? Would tournament be more appropriate, or would I want to use truncation in combination with the proportional selection? Also, how do you find a good mutation rate/population size, etc
sorry for so many questions but thanks in advance. I just don't really know where to start.
On my point of view the best way is to use adaptive reproduction strategy during evolution: at the first steps (let name it - "the first phase of calculations") you might set high mutation probability, at this phase you should find enough good solution. At the "second phase" of algorithm you might set decreasing of mutation probability every few steps - at this phase you should improve your solution. But sometimes in my practice I've noticed degradation of population during second phase of optimization (when each chromosome is strongly similar to other) - which effects with extremly slowing down of optimization pperformance, so my solution was to improve algorithm with high valued mutation random perturbations and it helps.
Also I'll advice you to read about differential evolution algorithm - http://en.wikipedia.org/wiki/Differential_evolution. As for me it's performance is much more faster than genetic algorithm.

Resources