How to interpret a node in decision tree with a categorical variable of more than 2 levels? - decision-tree

I am applying binary decision tree on a dataset containing both categorical and continues variables. One of the categorical variable is vehicle types (i.e. VT) which I encoded into five levels (VT=0,1,2,3,4). In one of the tree nodes I get VT<=3.5. How should I interpret? Does it mean if VT belongs to 0,1,2 or 3?

Related

How to do clustering on a set of paraboloids and on a set of planes?

I am performing cluster analysis in two parts: in the first part (a), it is clustering on a set of paraboloides and in the second part (b), on a set of planes. The parts are separated, but in both I initially had one set of images, on every image of which I have detected the points to which I (a) fit the paraboloid and (b) the plane. I obtained the equations of the surfaces (paraboloids and planes) so now I have 2 sets of data, for (a) it is the array of the arrays of size 6 (6 coefficients of the equation of the paraboloid) and for (b) it is the array of the arrays of size 3 (3 coefficients of the equation of the plane).
I want to cluster both groups based on the similarities of (a) paraboloids and (b) planes. I am not sure which features of the surfaces (paraboloids and planes) are suitable for clustering.
For (b) I have tried using the angle between the fitted plane and the plane z = 0 -- so only 1 feature for every object in the sample.
I have also tried simply considering these 3 (or 6) coefficients to be seperate variables, but I believe that this way I am not using the fact that this coefficients are connected with each other.
I would be really greatful to hear if there is a better approach what features to use except merely a set of coefficients. Also, I am performing hierarchical and agglomerative clustering.

Transforms of NLP dependency trees into binary trees?

Spacy (and Core NLP, and other parsers) output dependency trees that can contain varying numbers of children. In spacy for example, each node has a .lefts and .rights relations (multiple left branches and multiple right branches):
Pattern pattern matching algorithms of are considerably simpler (and more efficient) when they worked over predicates trees who's node have a fixed set of arities.
Is there any standard transformation from these multi-trees into from binary trees?
For example, in this example, we have "publish" with two .lefts=[just, journal] and, one .right=[piece]. Can sentences such (generally) be tranformed into a strict binary tree notation (where each node has 0 or 1 left, and 0 or 1 right branch) without much loss of information, or are multi-trees essential to correctly carrying information?
There are different types of trees in language analysis, immediate constituents and dependency trees (though you wouldn't normally talk of trees in dependency grammar). The former are usually binary (though there is no real reason why they have to be), as each category gets split into two subcategories, eg
S -> NP VP
NP -> det N1
N1 -> adj N1 | noun
Dependencies are not normally binary in nature, so there is no simple way to transform them into binary structures. The only fixed convention is that each word will be dependent on exactly one other word, but it might itself have multiple words depending on it.
So, the answer is basically "no".

scikit-learn Decision trees Regression: retrieve all samples for leaf (not mean)

I have started using scikit-learn Decision Trees and so far it is working out quite well but one thing I need to do is retrieve the set of sample Y values for the leaf node, especially when running a prediction. That is given an input feature vector X, I want to know the set of corresponding Y values at the leaf node instead of just the regression value which is the mean (or median) of those values. Of course one would want the sample mean to have a small variance but I do want to extract the actual set of Y values and do some statistics/create a PDF. I have used code like this how to extract the decision rules from scikit-learn decision-tree?
To print the decision tree but the output of the 'value' is the single float representing the mean. I have a large dataset so limit the leaf size to e.g. 100, I want to access those 100 values...
another solution is to use an (undocumented?) feature of the sklearn DecisionTreeRegressor object which is .tree.impurity
it returns the standard deviation of the values per each leaf

Multicollinearity of categorical variables

What are the different measures available to check for multicollinearity if the data contains both categorical and continuous independent variables?
Can I use VIF by converting categorical variables into dummy variables ? Is there a fundamental flaw in this since I could not locate any reference material on the internet ?
Can I use VIF by converting categorical variables into dummy variables ?
Yes, you can. There is no fundamental flaw in this approach.
if the data contains both categorical and continuous independent variables?
Multicollinearity doesn’t care if it’s a categorical variable or an integer variable. There is nothing special about categorical variables. Convert your categorical variables into binary, and treat them as all other variables.
I assume your concern would be categorical variables must be correlated to each other and it's a valid concern. Suppose the case when the proportion of cases in the reference category is small. Let's say there are 3 categorical variables: Overweight, normal, underweight. We can turn this into 2 categorical variable. Then, if one category's data is very small (like normal people are 5 out of 100 and all other 95 people are underweight or overweight), the indicator variables will necessarily have high VIFs, even if the categorical variable is not associated with other variables in the regression model.
What are the different measures available to check for multicollinearity
One way to detect multicollinearity is to take the correlation matrix of your data, and check the eigen values of the correlation matrix.
Eigen values close to 0 indicate the data are correlated.

Data mining for significant variables (numerical): Where to start?

I have a trading strategy on the foreign exchange market that I am attempting to improve upon.
I have a huge table (100k+ rows) that represent every possible trade in the market, the type of trade (buy or sell), the profit/loss after that trade closed, and 10 or so additional variables that represent various market measurements at the time of trade-opening.
I am trying to find out if any of these 10 variables are significantly related to the profits/losses.
For example, imagine that variable X ranges from 50 to -50.
The average value of X for a buy order is 25, and for a sell order is -25.
If most profitable buy orders have a value of X > 25, and most profitable sell orders have a value of X < -25 then I would consider the relationship of X-to-profit as significant.
I would like a good starting point for this. I have installed RapidMiner 5 in case someone can give me a specific recommendation for that.
A Decision Tree is perhaps the best place to begin.
The tree itself is a visual summary of feature importance ranking (or significant variables as phrased in the OP).
gives you a visual representation of the entire
classification/regression analysis (in the form of a binary tree),
which distinguishes it from any other analytical/statistical
technique that i am aware of;
decision tree algorithms require very little pre-processing on your data, no normalization, no rescaling, no conversion of discrete variables into integers (eg, Male/Female => 0/1); they can accept both categorical (discrete) and continuous variables, and many implementations can handle incomplete data (values missing from some of the rows in your data matrix); and
again, the tree itself is a visual summary of feature importance ranking
(ie, significant variables)--the most significant variable is the
root node, and is more significant than the two child nodes, which in
turn are more significant than their four combined children. "significance" here means the percent of variance explained (with respect to some response variable, aka 'target variable' or the thing
you are trying to predict). One proviso: from a visual inspection of
a decision tree you cannot distinguish variable significance from
among nodes of the same rank.
If you haven't used them before, here's how Decision Trees work: the algorithm will go through every variable (column) in your data and every value for each variable and split your data into two sub-sets based on each of those values. Which of these splits is actually chosen by the algorithm--i.e., what is the splitting criterion? The particular variable/value combination that "purifies" the data the most (i.e., maximizes the information gain) is chosen to split the data (that variable/value combination is usually indicated as the node's label). This simple heuristic is just performed recursively until the remaining data sub-sets are pure or further splitting doesn't increase the information gain.
What does this tell you about the "importance" of the variables in your data set? Well importance is indicated by proximity to the root node--i.e., hierarchical level or rank.
One suggestion: decision trees handle both categorical and discrete data usually without problem; however, in my experience, decision tree algorithms always perform better if the response variable (the variable you are trying to predict using all other variables) is discrete/categorical rather than continuous. It looks like yours is probably continuous, in which case in would consider discretizing it (unless doing so just causes the entire analysis to be meaningless). To do this, just bin your response variable values using parameters (bin size, bin number, and bin edges) meaningful w/r/t your problem domain--e.g., if your r/v is comprised of 'continuous values' from 1 to 100, you might sensibly bin them into 5 bins, 0-20, 21-40, 41-60, and so on.
For instance, from your Question, suppose one variable in your data is X and it has 5 values (10, 20, 25, 50, 100); suppose also that splitting your data on this variable with the third value (25) results in two nearly pure subsets--one low-value and one high-value. As long as this purity were higher than for the sub-sets obtained from splitting on the other values, the data would be split on that variable/value pair.
RapidMiner does indeed have a decision tree implementation, and it seems there are quite a few tutorials available on the Web (e.g., from YouTube, here and here). (Note, I have not used the decision tree module in R/M, nor have i used RapidMiner at all.)
The other set of techniques i would consider is usually grouped under the rubric Dimension Reduction. Feature Extraction and Feature Selection are two perhaps the most common terms after D/R. The most widely used is PCA, or principal-component analysis, which is based on an eigen-vector decomposition of the covariance matrix (derived from to your data matrix).
One direct result from this eigen-vector decomp is the fraction of variability in the data accounted for by each eigenvector. Just from this result, you can determine how many dimensions are required to explain, e.g., 95% of the variability in your data
If RapidMiner has PCA or another functionally similar dimension reduction technique, it's not obvious where to find it. I do know that RapidMiner has an R Extension, which of course let's you access R inside RapidMiner.R has plenty of PCA libraries (Packages). The ones i mention below are all available on CRAN, which means any of the PCA Packages there satisfy the minimum Package requirements for documentation and vignettes (code examples). I can recommend pcaPP (Robust PCA by Projection Pursuit).
In addition, i can recommend two excellent step-by-step tutorials on PCA. The first is from the NIST Engineering Statistics Handbook. The second is a tutorial for Independent Component Analysis (ICA) rather than PCA, but i mentioned it here because it's an excellent tutorial and the two techniques are used for the similar purposes.

Resources