Regression Analysis with insignificant regressors - statistics

I am trying to predict a variable using a number of explanatory variables each of which has no visually detectable relationship, that is the scatterplots between each regressor and the predicted variable are completely flat clouds.
I took 2 approaches:
1) Running individual regressions, yields not significant relationship at all.
2) Once I play around with multiple combinations of multivariable regressions, I get significant relationships for some combinations (which are not robust though, that is, a variable is significant in one setting and looses signifcance in a different setting).
I am wondering, if based on 1), i.e. the fact that on an individual basis, there seems to be no relationship at all, I can conclude that a multivariable aprroach is destined to fail as well?

The answer is most definitely no, it is not guaranteed to fail. In fact you've already observed this to be the case in #2 where you get significant predictors in a multiple predictor setting
A regression between 1 predictor and 1 outcome amounts to the covariance or the correlation between the two variables. This is the relationship you observe in your scatterplots.
A regression where you have multiple predictors (multiple regression) has a rather different interpretation. Lets say you have a model like: Y = b0 + b1X1 + b2X2
b1 is interpreted as the relationship between X1 and Y holding X2 constant. That is, you are controlling for the effect of X2 in this model. This is a very important feature of multiple regression.
To see this, run the following models:
Y = b0 + b1X1
Y = b0 + b1X1 + b2X2
You will see that the value of b1 in both cases are different. The degree of difference between the b1 values will depend on the magnitude of the covariance/correlation between X1 and X2
Just because a straight correlation between 2 variables is not significant does not mean that the relationship will remain non-significant once you control for the effect of other predictors
This point is highlighted by your example of robustness in #2. Why would a predictor be significant in some models, and non-significant when you use another subset of predictors? It is precisely because you are controlling for the effect of different variables in your various models.
Which other variables you choose to control for, and ultimately which specific regression model you choose to use, depends on what your goals are.

Related

Why does an lmer model converge in one experimental condition but not another?

I am new to using linear mixed-effects models. I have a dataset where participants (ID, N = 7973) completed two experimental conditions (A and B). A subset of participants are siblings and thus nested in families (famID, N = 6908).
omnibus_model <- lmer(Outcome ~ Var1*Var2*Cond + (Cond|ID) + (1|famID), data=df)
The omnibus model converges and indicates a significant three way interaction between Var1, Var2 and Cond. As a post-hoc, to better understand what is driving the omnibus model effect, I subsetted the data so that there is only one observation per ID.
condA <- df[which(df$condition=='A'),]
condA_model <- lmer(Outcome ~ Var1*Var2 + (1|famID), data=condA)
condB <- df[which(df$condition=='B'),]
condB_model <- lmer(Outcome ~ Var1*Var2 + (1|famID), data=condB)
condA_model converges; condB_model does not. In condB_model "famID (Intercept)" variance is estimated at 0. In the condA_model, I get a small, but non-zero estimate (variance=0.001479). I know I could get an estimate of the fixed effect of interest in condition A versus B by a different method(such as randomly selecting one sibling per family for the analysis and not using random effects), but I am concerned that this differential convergence pattern may indicate differences between the conditions that would influence the interpretation of the omnibus model effect.
What difference in the two conditions could causing the model in one subset not to converge? How would I test for the possible differences in my data? Shouldn't the random effect of famID be identical in both subsets and thus equally able to be estimated in both post-hoc models?
As a post-hoc, to better understand what is driving the omnibus model effect, I subsetted the data so that there is only one observation per ID.
This procedure does not make sense.
What difference in the two conditions could causing the model in one subset not to converge?
There are many reasons. For one thing, these reduced datasets are, well, reduced, ie smaller, so there is far less statistical power to detect the "effects" that you are interested in, such as a variance of a random effect. In such cases, it may be estimated as zero and result in a singular fit.
Shouldn't the random effect of famID be identical in both subsets and thus equally able to be estimated in both post-hoc models?
No, these are completely different models, since the underlying data are different. There is no reason to expect the same estimates from both models.

How can r-squared be negative when the correlation between prediction and truth is positive?

Trying to understand how the r-squared (and also explained variance) metrics can be negative (thus indicating non-existant forecasting power) when at the same time the correlation factor between prediction and truth (as well as slope in a linear-regression (regressing truth on prediction)) are positive
R Squared can be negative in a rare scenario.
R squared = 1 – (SSR/SST)
Here, SST stands for Sum of Squared Total which is nothing but how much does the predicted points get varies from the mean of the target variable. Mean is nothing but a regression line here.
SST = Sum (Square (Each data point- Mean of the target variable))
For example,
If we want to build a regression model to predict height of a student with weight as the independent variable then a possible prediction without much effort is to calculate the mean height of all current students and consider it as the prediction.
In the above diagram, red line is the regression line which is nothing but the mean of all heights. This mean calculated without much effort and can be considered as one of the worst method of prediction with poor accuracy. In the diagram itself we can see that the prediction is nowhere near to the original data points.
Now come to SSR,
SSR stands for Sum of Squared Residuals. This residual is calculated from the model which we build from our mathematical approach (Linear regression, Bayesian regression, Polynomial regression or any other approach). If we use a sophisticated approach rather than using a naive approach like mean then our accuracy will obviously increase.
SSR = Sum (Square (Each data point - Each corresponding data point in the regression line))
In the above diagram, let's consider that the blue line indicates a sophisticated model with large mathematical analysis. We can see that it has obviously higher accuracy than the red line.
Now come to the formula,
R Squared = 1- (SSR/SST)
Here,
SST will be large number because it a very poor model (red line).
SSR will be a small number because it is the best model we developed
after much mathematical analysis (blue line).
So, SSR/SST will be a very small number (It will become very small
whenever SSR decreases).
So, 1- (SSR/SST) will be large number.
So we can infer that whenever R Squared goes higher, it means the
model is too good.
This is a generic case but this cannot be applied in many cases where multiple independent variables are present. In the example, we had only one independent variable and one target variable but in real case, we will have 100's of independent variables for a single dependent variable. The actual problem is that, out of 100's of independent variables-
Some variables will have very high correlation with target variable.
Some variables will have very small correlation with target variable.
Also some independent variables will have no correlation at all.
So, RSquared is calculated on an assumption that the average line of the target which is perpendicular line of y axis is the worst fit a model can have at a maximum riskiest case. SST is the squared difference between this average line and original data points. Similarly, SSR is the squared difference between the predicted data points (by the model plane) and original data points.
SSR/SST gives a ratio how SSR is worst with respect to SST. If your model can somewhat build a plane which is a comparatively good than the worst, then in 99% cases SSR<SST. It eventually makes R squared as positive if you substitute it in the equation.
But what if SSR>SST ? This means that your regression plane is worse than the mean line (SST). In this case, R squared will be obviously negative. But it happens only at 1% of cases or smaller.
Answer was originally written in quora by me -
https://qr.ae/pNsLU8
https://qr.ae/pNsLUr

Under what conditions can two classes have different average values, yet be indistinguishable to a SVM?

I am asking because I have observed sometimes in neuroimaging that a brain region might have different average activation between two experimental conditions, but sometimes an SVM classifier somehow can't distinguish the patterns of activation between the two conditions.
My intuition is that this might happen in cases where the within-class variance is far greater than the between-class variance. For example, suppose we have two classes, A and B, and that for simplicity our data consists just of integers (rather than vectors). Let the data falling under class A be 0,0,0,0,0,10,10,10,10,10. Let the data falling under class B be 1,1,1,1,1,11,11,11,11,11. Here, A and B are clearly different on average, yet there's no decision boundary that would allow A and B to be distinguished. I believe this logic would hold even if our data consisted of vectors, rather than integers.
Is this a special case of some broader range of cases where an SVM would fail to distinguish two classes that are different on average? Is it possible to delineate the precise conditions under which an SVM classifier would fail to distinguish two classes that differ on average?
EDIT: Assume a linear SVM.
As described in the comments - there are no such conditions because SVM will separate data just fine (I am not talking about any generalisation here, just separating training data). For the rest of the answer I am assuming there are no two identical points with different labels.
Non-linear case
For a kernel case, using something like RBF kernel, SVM will always perfectly separate any training set, given that C is big enough.
Linear case
If data is linearly separable then again - with big enough C it will separate data just fine. If data is not linearly separable, cranking up C as much as possible will lead to smaller and smaller training error (of course it will not get 0 since data is not linearly separable).
In particular for the data you provided kernelized SVM will get 100%, and any linear model will get 50%, but it has nothing to do with means being different or variances relations - it is simply a dataset where any linear separator has at most 50% accuracy, literally every decision point, thus it has nothing to do with SVM. In particular it will separate them "in the middle", meaning that the decision point will be somewhere around "5".

Why does Naive Bayes fail to solve XOR

I have recently started understanding algorithms related to natural language processing, and have come across various sites which indicate that Naive Bayes cannot capture the XOR concept. Firstly I do not understand what exactly is the XOR problem. Can someone please explain, what the XOR problem is with a simple classification example if possible.
The XOR problem is the most simple problem that is not linearly separable.
Imagine you have two Boolean variables X and Y, and the target value you want to "predict" is the result from XORing the two variables. That is, only when either (but not the other) is 1, you want to predict 1 as outcome, and 0 otherwise. A bit more graphically:
Y ^
1 | XOR(x=0,y=1)=1 XOR(x=1,y=1)=0
|
0 | XOR(x=0,y=0)=0 XOR(x=1,y=0)=1
+------------------------------->
0 1 X
As you can see, for the four "points" of my "plot" above (X horizontally, Y vertically; imagine the commas are the "points", if you like), there is no way you can draw a straight line that separates the two outcomes (the two 1s in the upper left and lower right, and the two 0s, also in opposing corners). So linear classifiers, which model the class separation using straight lines, cannot solve problems of this nature.
Now, as to Naive Bayes, it models independent events. Given only X and Y, it can model the distribution of xs and it can model the ys, but it does not model any relation between the two variables. That is, to model the XOR function, the classifier would have to observe both variables at the same time. Only making a prediction based on the state of X without taking into account Y's state (and vice versa) cannot lead to a proper solution for this problem. Hence, the Naive Bayes classifier is a linear classifier, too.

Cluster Scenario: Difference between the computedCost of 2 points used as similarity measure between points. Is it applicable?

I want to have a measure of similarity between two points in a cluster.
Would the similarity calculated this way be an acceptable measure of similarity between the two datapoint?
Say I have to vectors: vector A and vector B that are in the same cluster. I have trained a cluster which is denoted by model and then model.computeCost() computes thesquared distance between the input point and the corresponding cluster center.
(I am using Apache Spark MLlib)
val costA = model.computeCost(A)
val costB = model.computeCost(B)
val dissimilarity = |cost(A)-cost(B)|
Dissimilarity i.e. the higher the value, the more unlike each other they are.
If you are just asking is this a valid metric then the answer is almost, it is a valid pseudometric if only .computeCost is deterministic.
For simplicity i denote f(A) := model.computeCost(A) and d(A, B) := |f(A)-f(B)|
Short proof: d is a L1 applied to an image of some function, thus is a pseudometric itself, and a metric if f is injective (in general, yours is not).
Long(er) proof:
d(A,B) >= 0 yes, since |f(A) - f(B)| >= 0
d(A,B) = d(B,A) yes, since |f(A) - f(B)| = |f(B) - f(A)|
d(A,B) = 0 iff A=B, no, this is why it is pseudometric, since you can have many A != B such that f(A) = f(B)
d(A,B) + d(B,C) <= d(A,C), yes, directly from the same inequality for absolute values.
If you are asking will it work for your problem, then the answer is it might, depends on the problem. There is no way to answer this without analysis of your problem and data. As shown above this is a valid pseudometric, thus it will measure something decently behaving from mathematical perspective. Will it work for your particular case is completely different story. The good thing is most of the algorithms which work for metrics will work with pseudometrics as well. The only difference is that you simply "glue together" points which have the same image (f(A)=f(B)), if this is not the issue for your problem - then you can apply this kind of pseudometric in any metric-based reasoning without any problems. In practise, that means that if your f is
computes the sum of squared distances between the input point and the corresponding cluster center
this means that this is actually a distance to closest center (there is no summation involved when you consider a single point). This would mean, that 2 points in two separate clusters are considered identical when they are equally far away from their own clusters centers. Consequently your measure captures "how different are relations of points and their respective clusters". This is a well defined, indirect dissimilarity computation, however you have to be fully aware what is happening before applying it (since it will have specific consequences).
Your "cost" is actually the distance to the center.
Points that have the same distance to the center are considered to be identical (distance 0), which creates a really odd pseudonetric, because it ignores where on the circle of that distance points are.
It's not very likely this will work on your problem.

Resources