I would like to know how the splits in the decision tree look like.
More specifically, I am wondering if they can be "relative". For example can the criterion be something like:
if x > y ..
or does it have to be some absolute value, as in:
if x > 0.5 ..
I don't know if it's worth creating synthetic features that introduce relationships between features or if that's already included per default.
Related
Spacy (and Core NLP, and other parsers) output dependency trees that can contain varying numbers of children. In spacy for example, each node has a .lefts and .rights relations (multiple left branches and multiple right branches):
Pattern pattern matching algorithms of are considerably simpler (and more efficient) when they worked over predicates trees who's node have a fixed set of arities.
Is there any standard transformation from these multi-trees into from binary trees?
For example, in this example, we have "publish" with two .lefts=[just, journal] and, one .right=[piece]. Can sentences such (generally) be tranformed into a strict binary tree notation (where each node has 0 or 1 left, and 0 or 1 right branch) without much loss of information, or are multi-trees essential to correctly carrying information?
There are different types of trees in language analysis, immediate constituents and dependency trees (though you wouldn't normally talk of trees in dependency grammar). The former are usually binary (though there is no real reason why they have to be), as each category gets split into two subcategories, eg
S -> NP VP
NP -> det N1
N1 -> adj N1 | noun
Dependencies are not normally binary in nature, so there is no simple way to transform them into binary structures. The only fixed convention is that each word will be dependent on exactly one other word, but it might itself have multiple words depending on it.
So, the answer is basically "no".
I have started using scikit-learn Decision Trees and so far it is working out quite well but one thing I need to do is retrieve the set of sample Y values for the leaf node, especially when running a prediction. That is given an input feature vector X, I want to know the set of corresponding Y values at the leaf node instead of just the regression value which is the mean (or median) of those values. Of course one would want the sample mean to have a small variance but I do want to extract the actual set of Y values and do some statistics/create a PDF. I have used code like this how to extract the decision rules from scikit-learn decision-tree?
To print the decision tree but the output of the 'value' is the single float representing the mean. I have a large dataset so limit the leaf size to e.g. 100, I want to access those 100 values...
another solution is to use an (undocumented?) feature of the sklearn DecisionTreeRegressor object which is .tree.impurity
it returns the standard deviation of the values per each leaf
The Readme of HLearn states that the Monoid typeclass is used for parallel batch training. I've seen trainMonoid mentioned in several files, but I'm having a difficulty to dissect this huge codebase. Could someone explain in beginner-friendly terms how does it work? I guess it's somehow related to the associativity property.
It's explained in this article which is linked in the page you linked in the question. Since you want a beginner friendly description I'll give you a very high level description of what I understood after reading the article. Consider this as a rough overview of the idea, to understand exactly everything you have to study the articles.
The basic idea is to use algebraic properties to avoid re-doing the same work over and over. They do it by using the associativity of the monoidal operation and homomorphisms.
Given two sets A and B with two binary operations + and * an homomorphism is a function f: A -> B such that f(x + y) = f(x) * f(y), i.e. it's a function that preserves the structure between the two sets.
In the case of that article the function f is basically the function that maps the sets of inputs to the trained models.
So the idea is that you can divide the input data into different portions x and y, and instead of having to compute the model of the whole thing as in T(x + y) you can do the training on just x and y and then combine the results: T(x) * T(y).
Now this as it is doesn't really help that much but, in training you often repeat work. For example in cross validation you, for k times, sample the data into a set of inputs for the trainer and a set of data used to test the trainer. But this means that in these k iterations you are executing T over the same portions of inputs multiple times.
Here monoids come into play: you can first split the domain into subsets, and compute T over these subsets, and then to compute the result of cross validation you can just put together the results from the corresponding subsets.
To give an idea: if the data is {1,2,3,4} and k = 3 instead of doing:
T({1,2}) plus test on {3, 4}
T({1,3}) plus test on {2, 4}
T({1,4}) plus test on {2, 3}
Here you can see that we trained over 1 for three times. Using the homomorphism we can compute T({1}) once and then combine the result with the other partial result to obtain the final trained model.
The correctness of the final result is assured by the associativity of the operations and the homomorphism.
The same idea can be applied when parallelizing: you divide the inputs into k groups, perform the training in parallel and then compound the results: T(x_1 + x_2 + ... + x_k) = T(x_1) * T(x_2) * ... * T(x_k) where the T(x_i) calls are performed completely in parallel and only at the end you have to compound the results.
Regarding online training algorithms, the idea is that given a "batch" training algorithm T you can make it into an online one by doing:
T_O(m, d) = m * T(d)
where m is an already trained model (which generally will be the trained model until that point) and d is the new data point you add for training.
Again the exactness of the result is due to the homorphism that tells you that if m = T(x) then m * T(d) = T(x+d), i.e. the online algorithm gives the same result of the batch algorithm with all those data points.
The more interesting (and complex) part of all of this is how can you see a training task as such a homomorphism etc. I'll leave that to your personal study.
For big datasets with 2bil+ samples and approximately 100+ features per sample. Among these, 10% features you have are numerical/continuous variables and the rest of it are categorical variables (position, languages, url etc...).
Let's use some examples:
e.g: dummy categorical feature
feature: Position
real values: SUD | CENTRE | NORTH
encoded values: 1 | 2 | 3
...would have sense use reduction like SVD because distance beetween sud:north > sud:centre and, moreover, it's possible to encode (e.g OneHotEncoder, StringIndexer) this variable because of the small cardinality of it values-set.
e.g: real categorical feature
feature: url
real values: very high cardinality
encoded values: ?????
1) In MLlibthe 90% of the model works just with numerical values (a part of Frequent Itemset and DecisionTree techniques)
2) Features transformers/reductor/extractor as PCA or SVD are not good for these kind of data, and there is no implementation of (e.g) MCA
a) Which could be your approach to engage with this kind of data in spark, or using Mllib?
b) Do you have any suggestions to cope with this much categorical values?
c) After reading a lot in literature, and counting the implemented model in spark, my idea, about make inference on one of that features using the others (categorical), the models at point 1 could be the best coiche. What do you think about it?
(to standardize a classical use case you can imagine the problem of infer the gender of a person using visited url and other categorical features).
Given that I am a newbie in regards to MLlib, may I ask you to provide a concrete example?
Thanks in advance
Well, first I would say stackoverflow works in a different way, you should be the one providing a working example with the problem you are facing and we help you out using this example.
Anyways I got intrigued with the use of the categorical values like the one you show as position. If this is a categorical value as you mention with 3 levels SUD,CENTRE, NORTH, there is no distance between them if they are truly categorical. In this sense I would create dummy variables like:
SUD_Cat CENTRE_Cat NORTH_Cat
SUD 1 0 0
CENTRE 0 1 0
NORTH 0 0 1
This is a truly dummy representation of a categorical variable.
On the other hand if you want to take that distance into account then you have to create another feature which takes this distance into account explicitly, but that is not a dummy representation.
If the problem you are facing is that after you wrote your categorical features as dummy variables (note that now all of them are numerical) you have very many features and you want to reduce your feature's space, then is a different problem.
As a rule of thumbs I try to utilize the entire feature space first, now a plus since in spark computing power allows you to run modelling tasks with big datasets, if it is too big then I would go for dimensionality reduction techniques, PCA etc...
i have a simple question that i didn't understand:
Why decision tree in Scikit Learn is Binary tree instead of n-ary tree?
Anyone knows the answer? Please tell me, thank you so much.
This is better suited for the cross-validated site, but the answer is simplicity. Any decision tree simply partitions space with leaf nodes being data and assignment being a function of that data, typically majority in case of classification and empirical average in case of regression.
However, every decision tree can be converted into a binary decision tree. Intuitively, if you have a rule at level 1 like ( X1 < 1 AND X2 > 10) then this can be converted into a two-level run by shifting one part of the predicate downward.
It is much simpler to train binary decision tree than n-ary because of the combinatorial explosion that takes place. Instead of randomly picking a splitting variable and then optimizing over that field (1-d optimization) N-ary trees must select a subset of variables and optimize over that set.