I'm trying to understand how the feature importance is calculated for regression trees (and their ensemble counterparts). I'm looking at the source code for the function compute_feature_importances in /sklearn/tree/_tree.pyx and cannot quite follow the logic - and there is no reference.
Sorry this may be a very basic question, but I couldn't find a good literature reference for this, and I was hoping someone could either point me in the right direction, or quickly explain the code so I can keep digging.
Thanks
The reference is in the docs rather than the code:
`feature_importances_` : array of shape = [n_features]
The feature importances. The higher, the more important the
feature. The importance of a feature is computed as the (normalized)
total reduction of the criterion brought by that feature. It is also
known as the Gini importance [4]_.
.. [4] L. Breiman, and A. Cutler, "Random Forests",
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Related
I'm working on a simple project in which I'm trying to describe the relationship between two positively correlated variables and determine if that relationship is changing over time, and if so, to what degree. I feel like this is something people probably do pretty often, but maybe I'm just not using the correct terminology because google isn't helping me very much.
I've plotted the variables on a scatter plot and know how to determine the correlation coefficient and plot a linear regression. I thought this may be a good first step because the linear regression tells me what I can expect y to be for a given x value. This means I can quantify how "far away" each data point is from the regression line (I think this is called the squared error?). Now I'd like to see what the error looks like for each data point over time. For example, if I have 100 data points and the most recent 20 are much farther away from where the regression line/function says it should be, maybe I could say that the relationship between the variables is showing signs of changing? Does that make any sense at all or am I way off base?
I have a suspicion that there is a much simpler way to do this and/or that I'm going about it in the wrong way. I'd appreciate any guidance you can offer!
I can suggest two strands of literature that study changing relationships over time. Typing these names into google should provide you with a large number of references so I'll stick to more concise descriptions.
(1) Structural break modelling. As the name suggest, this assumes that there has been a sudden change in parameters (e.g. a correlation coefficient). This is applicable if there has been a policy change, change in measurement device, etc. The estimation approach is indeed very close to the procedure you suggest. Namely, you would estimate the squared error (or some other measure of fit) on the full sample and the two sub-samples (before and after break). If the gains in fit are large when dividing the sample, then you would favour the model with the break and use different coefficients before and after the structural change.
(2) Time-varying coefficient models. This approach is more subtle as coefficients will now evolve more slowly over time. These changes can originate from the time evolution of some observed variables or they can be modeled through some unobserved latent process. In the latter case the estimation typically involves the use of state-space models (and thus the Kalman filter or some more advanced filtering techniques).
I hope this helps!
I have a two 3D variables for a each time step (so I have N 3d matrix var(Nx,Ny,Nz), for each variables). I want to construct the two point statistics but I guess I'm doing something wrong.
Two-point statistics formula, where x_r is the reference point and x is the independent variable
I know that the theoretical formulation of a two-point cross correlation is the one written above.
Let's for sake of simplicity ignore the normalization, so I'm focusing on the numerator, that is the part I'm struggling with.
So, my two variables are two 3D matrix, with the following notation phi(x,y,z) = phi(i,j,k), same for psi.
My aim is to compute a 3d correlation, so given a certain reference point Reference_Point = (xr,yr,zr), but I guess I'm doing something wrong. I'm trying that on MATLAB, but my results are not accurate, and by doing some researches online it does seem that I should do convolutions or fft, but I don't find any theoretical framework that explains how to do that and why the formulation above in practices should be implemented by the use of a conv or fft. Moreover I would like to implement my cross-correlation in the spatial domain and not in the frequency domain, and with the convolution I don't understand how to choose the reference point.
Thank you so much in advance for reply
Despite going through lots of similar question related to this I still could not understand why some algorithm is susceptible to it while others are not.
Till now I found that SVM and K-means are susceptible to feature scaling while Linear Regression and Decision Tree are not.Can somebody please elaborate me why? in general or relating to this 4 algorithm.
As I am a beginner, please explain this in layman terms.
One reason I can think of off-hand is that SVM and K-means, at least with a basic configuration, uses an L2 distance metric. An L1 or L2 distance metric between two points will give different results if you double delta-x or delta-y, for example.
With Linear Regression, you fit a linear transform to best describe the data by effectively transforming the coordinate system before taking a measurement. Since the optimal model is the same no matter the coordinate system of the data, pretty much by definition, your result will be invariant to any linear transform including feature scaling.
With Decision Trees, you typically look for rules of the form x < N, where the only detail that matters is how many items pass or fail the given threshold test - you pass this into your entropy function. Because this rule format does not depend on dimension scale, since there is no continuous distance metric, we again have in-variance.
Somewhat different reasons for each, but I hope that helps.
Does anyone can give an example of a SVM? Especially how to get the w and b from a training set?
I tried to search in the internet but it only gives me a large amount of abstract mathematics.
As I am not good at it, so could anyone give me an illustration of a SVM with an example in very details?
Thank you so much.
This diagram on wikipedia provides a good example of what the goal is, but in truth a support vector machine is a lot of complicated math. You find the values for w and b by optimizing a quadratic programming system, and when hidden behind vector mathematics, it's not entirely clear what's going on unless you're well-tuned to the math.
I just recently started learning about genetic algorithms and am now trying to implement them in 2D shape optimization in physics simulaiton. The simulation produces a single scalar for each shape. (I guess this is kind of similar to boxcar2d http://boxcar2d.com/)
The 2D shapes are actually the union of several 2D "sub shapes." Each subshape is stored as an list of angles/radii. The 2D shape is then stored as a list of subshape lists. This serves as my chromosone right now.
Right now for fitness, I will probably use the scalar the simulation produced. My question is, how should I go about the selection and reproduction process? Would tournament be more appropriate, or would I want to use truncation in combination with the proportional selection? Also, how do you find a good mutation rate/population size, etc
sorry for so many questions but thanks in advance. I just don't really know where to start.
On my point of view the best way is to use adaptive reproduction strategy during evolution: at the first steps (let name it - "the first phase of calculations") you might set high mutation probability, at this phase you should find enough good solution. At the "second phase" of algorithm you might set decreasing of mutation probability every few steps - at this phase you should improve your solution. But sometimes in my practice I've noticed degradation of population during second phase of optimization (when each chromosome is strongly similar to other) - which effects with extremly slowing down of optimization pperformance, so my solution was to improve algorithm with high valued mutation random perturbations and it helps.
Also I'll advice you to read about differential evolution algorithm - http://en.wikipedia.org/wiki/Differential_evolution. As for me it's performance is much more faster than genetic algorithm.