Does anyone can give an example of a SVM? Especially how to get the w and b from a training set?
I tried to search in the internet but it only gives me a large amount of abstract mathematics.
As I am not good at it, so could anyone give me an illustration of a SVM with an example in very details?
Thank you so much.
This diagram on wikipedia provides a good example of what the goal is, but in truth a support vector machine is a lot of complicated math. You find the values for w and b by optimizing a quadratic programming system, and when hidden behind vector mathematics, it's not entirely clear what's going on unless you're well-tuned to the math.
Related
I'm working on a simple project in which I'm trying to describe the relationship between two positively correlated variables and determine if that relationship is changing over time, and if so, to what degree. I feel like this is something people probably do pretty often, but maybe I'm just not using the correct terminology because google isn't helping me very much.
I've plotted the variables on a scatter plot and know how to determine the correlation coefficient and plot a linear regression. I thought this may be a good first step because the linear regression tells me what I can expect y to be for a given x value. This means I can quantify how "far away" each data point is from the regression line (I think this is called the squared error?). Now I'd like to see what the error looks like for each data point over time. For example, if I have 100 data points and the most recent 20 are much farther away from where the regression line/function says it should be, maybe I could say that the relationship between the variables is showing signs of changing? Does that make any sense at all or am I way off base?
I have a suspicion that there is a much simpler way to do this and/or that I'm going about it in the wrong way. I'd appreciate any guidance you can offer!
I can suggest two strands of literature that study changing relationships over time. Typing these names into google should provide you with a large number of references so I'll stick to more concise descriptions.
(1) Structural break modelling. As the name suggest, this assumes that there has been a sudden change in parameters (e.g. a correlation coefficient). This is applicable if there has been a policy change, change in measurement device, etc. The estimation approach is indeed very close to the procedure you suggest. Namely, you would estimate the squared error (or some other measure of fit) on the full sample and the two sub-samples (before and after break). If the gains in fit are large when dividing the sample, then you would favour the model with the break and use different coefficients before and after the structural change.
(2) Time-varying coefficient models. This approach is more subtle as coefficients will now evolve more slowly over time. These changes can originate from the time evolution of some observed variables or they can be modeled through some unobserved latent process. In the latter case the estimation typically involves the use of state-space models (and thus the Kalman filter or some more advanced filtering techniques).
I hope this helps!
I'm using a tutorial (https://www.tidytextmining.com/nasa.html?q=correlation%20ne#networks-of-keywords) to learn about tidy text mining. I am hoping someone might be able to help with two questions:
in this tutorial, the correlation used to make the graph is 0.15. Is this best practice? I can't find any literature to help choose a cut off.
In the graph attached from the tutorial, how are clusters centrality chosen? Are more important words closer to the centre?
Thanks very much
I am not aware of any literature on a correlation threshold to use for this kind of network analysis; this will (I believe) depend on your particular dataset and how language is used in your context. This is a heuristic decision. Given what a correlation coefficient measures, I would expect 0.15 to be on the low side of what you might use.
The graph is represented visually in a two-dimensional plot via the layout argument of ggraph. You can read more about that here but the very high-level takeaways are that there are a lot of options, they have a big impact on what your graph looks like, and often it's not clear what is the best choice.
Despite going through lots of similar question related to this I still could not understand why some algorithm is susceptible to it while others are not.
Till now I found that SVM and K-means are susceptible to feature scaling while Linear Regression and Decision Tree are not.Can somebody please elaborate me why? in general or relating to this 4 algorithm.
As I am a beginner, please explain this in layman terms.
One reason I can think of off-hand is that SVM and K-means, at least with a basic configuration, uses an L2 distance metric. An L1 or L2 distance metric between two points will give different results if you double delta-x or delta-y, for example.
With Linear Regression, you fit a linear transform to best describe the data by effectively transforming the coordinate system before taking a measurement. Since the optimal model is the same no matter the coordinate system of the data, pretty much by definition, your result will be invariant to any linear transform including feature scaling.
With Decision Trees, you typically look for rules of the form x < N, where the only detail that matters is how many items pass or fail the given threshold test - you pass this into your entropy function. Because this rule format does not depend on dimension scale, since there is no continuous distance metric, we again have in-variance.
Somewhat different reasons for each, but I hope that helps.
I am working on data mining problems and I have to find similarity between pair of objects. I know what all the statistical distances are, but fails to find any source that define when to use which statistical distance?
My answer is not going to be a plain "use that" because there is not such a thing in statistics.
I found my self in the past using statistical distances such as Mahalanobis, which is a particular case of Bhattacharyya distance when dealing with similar problems. I used KL-divergence when building trees (minimun spanning trees etc).
A main difference between the two is that Bhattacharyya is a metric and KL is not, so you have to take this into account when thinking about what kind of information you want to extract about your data points.
In brief, I would use the Bhattacharyya.
I'm trying to understand how the feature importance is calculated for regression trees (and their ensemble counterparts). I'm looking at the source code for the function compute_feature_importances in /sklearn/tree/_tree.pyx and cannot quite follow the logic - and there is no reference.
Sorry this may be a very basic question, but I couldn't find a good literature reference for this, and I was hoping someone could either point me in the right direction, or quickly explain the code so I can keep digging.
Thanks
The reference is in the docs rather than the code:
`feature_importances_` : array of shape = [n_features]
The feature importances. The higher, the more important the
feature. The importance of a feature is computed as the (normalized)
total reduction of the criterion brought by that feature. It is also
known as the Gini importance [4]_.
.. [4] L. Breiman, and A. Cutler, "Random Forests",
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm