Why does Naive Bayes fail to solve XOR

Why does Naive Bayes fail to solve XOR - nlp

I have recently started understanding algorithms related to natural language processing, and have come across various sites which indicate that Naive Bayes cannot capture the XOR concept. Firstly I do not understand what exactly is the XOR problem. Can someone please explain, what the XOR problem is with a simple classification example if possible.

The XOR problem is the most simple problem that is not linearly separable.
Imagine you have two Boolean variables X and Y, and the target value you want to "predict" is the result from XORing the two variables. That is, only when either (but not the other) is 1, you want to predict 1 as outcome, and 0 otherwise. A bit more graphically:
Y ^
1 | XOR(x=0,y=1)=1 XOR(x=1,y=1)=0
|
0 | XOR(x=0,y=0)=0 XOR(x=1,y=0)=1
+------------------------------->
0 1 X
As you can see, for the four "points" of my "plot" above (X horizontally, Y vertically; imagine the commas are the "points", if you like), there is no way you can draw a straight line that separates the two outcomes (the two 1s in the upper left and lower right, and the two 0s, also in opposing corners). So linear classifiers, which model the class separation using straight lines, cannot solve problems of this nature.
Now, as to Naive Bayes, it models independent events. Given only X and Y, it can model the distribution of xs and it can model the ys, but it does not model any relation between the two variables. That is, to model the XOR function, the classifier would have to observe both variables at the same time. Only making a prediction based on the state of X without taking into account Y's state (and vice versa) cannot lead to a proper solution for this problem. Hence, the Naive Bayes classifier is a linear classifier, too.

Related

Regression Analysis with insignificant regressors

I am trying to predict a variable using a number of explanatory variables each of which has no visually detectable relationship, that is the scatterplots between each regressor and the predicted variable are completely flat clouds.
I took 2 approaches:
1) Running individual regressions, yields not significant relationship at all.
2) Once I play around with multiple combinations of multivariable regressions, I get significant relationships for some combinations (which are not robust though, that is, a variable is significant in one setting and looses signifcance in a different setting).
I am wondering, if based on 1), i.e. the fact that on an individual basis, there seems to be no relationship at all, I can conclude that a multivariable aprroach is destined to fail as well?

The answer is most definitely no, it is not guaranteed to fail. In fact you've already observed this to be the case in #2 where you get significant predictors in a multiple predictor setting
A regression between 1 predictor and 1 outcome amounts to the covariance or the correlation between the two variables. This is the relationship you observe in your scatterplots.
A regression where you have multiple predictors (multiple regression) has a rather different interpretation. Lets say you have a model like: Y = b0 + b1X1 + b2X2
b1 is interpreted as the relationship between X1 and Y holding X2 constant. That is, you are controlling for the effect of X2 in this model. This is a very important feature of multiple regression.
To see this, run the following models:
Y = b0 + b1X1
Y = b0 + b1X1 + b2X2
You will see that the value of b1 in both cases are different. The degree of difference between the b1 values will depend on the magnitude of the covariance/correlation between X1 and X2
Just because a straight correlation between 2 variables is not significant does not mean that the relationship will remain non-significant once you control for the effect of other predictors
This point is highlighted by your example of robustness in #2. Why would a predictor be significant in some models, and non-significant when you use another subset of predictors? It is precisely because you are controlling for the effect of different variables in your various models.
Which other variables you choose to control for, and ultimately which specific regression model you choose to use, depends on what your goals are.

What N ((1,0)T , I) mean related to Gaussian Distribution

Hi everyone I am reading a book "Element of Statistical Learning) and came across the below paragraph which i dont I understand. (explains how the training data was generated)
We generated 10 means mk from a bivariate Gaussian distribution N((0,1)T,I) and labeled this class as blue. Similraly, 10 more were drawn from from N((0,1)T,I) and labeled class Orange. Then for each class we generated 100 observations as follows: for each observation, we picked an mk at random with probability 1/10, and then generated a N(mk, I/5), thus leading to a mixture of Gaussian cluster for each class.
I would appreciate if you could explain the above paragraph and especially N((0,1)T,I)
by the way- (0,1) to the power of T for Transpose.
Is this notation mathmatically common or related to a specific computer language.

In the paragraph N stands for the Normal distribution; more specifically, in this case it stands for the Multivariate normal distribution. It is not specific to any programming languages. It comes from statistics and probability theory, but due to numerous appealing properties and important applications of this probability distribution it is also widely used in programming, so you should be able to perform the described procedure in any language.
The part (0,1)^T is a vector of means. That is, we have in mind a random vector of length two, where the first element on average is 0, and the second one on average is 1.
"I" stands for the 2x2 identity matrix whose role is the variance-covariance matrix. That is, the variance of both random vector components is 1 (i.e., the diagonal terms), while off-diagonal points are 0 and correspond to the covariance between the two random variables.

Under what conditions can two classes have different average values, yet be indistinguishable to a SVM?

I am asking because I have observed sometimes in neuroimaging that a brain region might have different average activation between two experimental conditions, but sometimes an SVM classifier somehow can't distinguish the patterns of activation between the two conditions.
My intuition is that this might happen in cases where the within-class variance is far greater than the between-class variance. For example, suppose we have two classes, A and B, and that for simplicity our data consists just of integers (rather than vectors). Let the data falling under class A be 0,0,0,0,0,10,10,10,10,10. Let the data falling under class B be 1,1,1,1,1,11,11,11,11,11. Here, A and B are clearly different on average, yet there's no decision boundary that would allow A and B to be distinguished. I believe this logic would hold even if our data consisted of vectors, rather than integers.
Is this a special case of some broader range of cases where an SVM would fail to distinguish two classes that are different on average? Is it possible to delineate the precise conditions under which an SVM classifier would fail to distinguish two classes that differ on average?
EDIT: Assume a linear SVM.

As described in the comments - there are no such conditions because SVM will separate data just fine (I am not talking about any generalisation here, just separating training data). For the rest of the answer I am assuming there are no two identical points with different labels.
Non-linear case
For a kernel case, using something like RBF kernel, SVM will always perfectly separate any training set, given that C is big enough.
Linear case
If data is linearly separable then again - with big enough C it will separate data just fine. If data is not linearly separable, cranking up C as much as possible will lead to smaller and smaller training error (of course it will not get 0 since data is not linearly separable).
In particular for the data you provided kernelized SVM will get 100%, and any linear model will get 50%, but it has nothing to do with means being different or variances relations - it is simply a dataset where any linear separator has at most 50% accuracy, literally every decision point, thus it has nothing to do with SVM. In particular it will separate them "in the middle", meaning that the decision point will be somewhere around "5".

Cluster Scenario: Difference between the computedCost of 2 points used as similarity measure between points. Is it applicable?

I want to have a measure of similarity between two points in a cluster.
Would the similarity calculated this way be an acceptable measure of similarity between the two datapoint?
Say I have to vectors: vector A and vector B that are in the same cluster. I have trained a cluster which is denoted by model and then model.computeCost() computes thesquared distance between the input point and the corresponding cluster center.
(I am using Apache Spark MLlib)
val costA = model.computeCost(A)
val costB = model.computeCost(B)
val dissimilarity = |cost(A)-cost(B)|
Dissimilarity i.e. the higher the value, the more unlike each other they are.

If you are just asking is this a valid metric then the answer is almost, it is a valid pseudometric if only .computeCost is deterministic.
For simplicity i denote f(A) := model.computeCost(A) and d(A, B) := |f(A)-f(B)|
Short proof: d is a L1 applied to an image of some function, thus is a pseudometric itself, and a metric if f is injective (in general, yours is not).
Long(er) proof:
d(A,B) >= 0 yes, since |f(A) - f(B)| >= 0
d(A,B) = d(B,A) yes, since |f(A) - f(B)| = |f(B) - f(A)|
d(A,B) = 0 iff A=B, no, this is why it is pseudometric, since you can have many A != B such that f(A) = f(B)
d(A,B) + d(B,C) <= d(A,C), yes, directly from the same inequality for absolute values.
If you are asking will it work for your problem, then the answer is it might, depends on the problem. There is no way to answer this without analysis of your problem and data. As shown above this is a valid pseudometric, thus it will measure something decently behaving from mathematical perspective. Will it work for your particular case is completely different story. The good thing is most of the algorithms which work for metrics will work with pseudometrics as well. The only difference is that you simply "glue together" points which have the same image (f(A)=f(B)), if this is not the issue for your problem - then you can apply this kind of pseudometric in any metric-based reasoning without any problems. In practise, that means that if your f is
computes the sum of squared distances between the input point and the corresponding cluster center
this means that this is actually a distance to closest center (there is no summation involved when you consider a single point). This would mean, that 2 points in two separate clusters are considered identical when they are equally far away from their own clusters centers. Consequently your measure captures "how different are relations of points and their respective clusters". This is a well defined, indirect dissimilarity computation, however you have to be fully aware what is happening before applying it (since it will have specific consequences).

Your "cost" is actually the distance to the center.
Points that have the same distance to the center are considered to be identical (distance 0), which creates a really odd pseudonetric, because it ignores where on the circle of that distance points are.
It's not very likely this will work on your problem.

standard error of addition, subtraction, multiplication and ratio

Let's say, I have two random variables,x and y, both of them have n observations. I've used a forecasting method to estimate xn+1 and yn+1, and I also got the standard error for both xn+1 and yn+1. So my question is that what the formula would be if I want to know the standard error of xn+1 + yn+1, xn+1 - yn+1, (xn+1)*(yn+1) and (xn+1)/(yn+1), so that I can calculate the prediction interval for the 4 combinations. Any thought would be much appreciated. Thanks.

Well, the general topic you need to look at is called "change of variables" in mathematical statistics.
The density function for a sum of random variables is the convolution of the individual densities (but only if the variables are independent). Likewise for the difference. In special cases, that convolution is easy to find. For example, for Gaussian variables the density of the sum is also a Gaussian.
For product and quotient, there aren't any simple results, except in special cases. For those, you might as well compute the result directly, maybe by sampling or other numerical methods.
If your variables x and y are not independent, that complicates the situation. But even then, I think sampling is straightforward.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string