Does fitting a linear regression and performing t test will give similar results? - statistics

I am trying to predict the statistically significant variables out of a list of binary variables. I am having a conceptual doubt in the below mentioned 2 approaches to find the relevant variables.
Dependent variable:
Height of a person
Independent variables:
Gender(Male or Female)
Financial_Status(Below Poverty Line or not)
College_Graduate(Yes or No)
Approach 1: Fitting a linear regression while taking these as dependent/independent variables and finding the statistically significant variables
Approach 2: Performing an individual statistical test for each dependent variable(t-test or some other relevant test) to compute the statistically significant variables
Are both of these approaches similar and will give similar results? If not, what's the exact difference?

Since you have multiple independent variables, than clearly no.
If you would like to go for the ttest approach for each of the values of the different independent variables (Gender, Financial_Status and College_Graduate) then it means you'll perform 3 different tests. Performing multiple tests is something that is risky in terms of false positive results, and thus should be adjusted with a multiple comparison adjustment method (Bonferoni, FDR, among others).
On the other hand, if you'll use a single multiavariate linear regression you wouldn't have the correct for multiple comparisons, which is why, in my opinion, is the better approach.

Related

How do I analyze the change in the relationship between two variables?

I'm working on a simple project in which I'm trying to describe the relationship between two positively correlated variables and determine if that relationship is changing over time, and if so, to what degree. I feel like this is something people probably do pretty often, but maybe I'm just not using the correct terminology because google isn't helping me very much.
I've plotted the variables on a scatter plot and know how to determine the correlation coefficient and plot a linear regression. I thought this may be a good first step because the linear regression tells me what I can expect y to be for a given x value. This means I can quantify how "far away" each data point is from the regression line (I think this is called the squared error?). Now I'd like to see what the error looks like for each data point over time. For example, if I have 100 data points and the most recent 20 are much farther away from where the regression line/function says it should be, maybe I could say that the relationship between the variables is showing signs of changing? Does that make any sense at all or am I way off base?
I have a suspicion that there is a much simpler way to do this and/or that I'm going about it in the wrong way. I'd appreciate any guidance you can offer!
I can suggest two strands of literature that study changing relationships over time. Typing these names into google should provide you with a large number of references so I'll stick to more concise descriptions.
(1) Structural break modelling. As the name suggest, this assumes that there has been a sudden change in parameters (e.g. a correlation coefficient). This is applicable if there has been a policy change, change in measurement device, etc. The estimation approach is indeed very close to the procedure you suggest. Namely, you would estimate the squared error (or some other measure of fit) on the full sample and the two sub-samples (before and after break). If the gains in fit are large when dividing the sample, then you would favour the model with the break and use different coefficients before and after the structural change.
(2) Time-varying coefficient models. This approach is more subtle as coefficients will now evolve more slowly over time. These changes can originate from the time evolution of some observed variables or they can be modeled through some unobserved latent process. In the latter case the estimation typically involves the use of state-space models (and thus the Kalman filter or some more advanced filtering techniques).
I hope this helps!

Testing significance of Strauss parameters in mppm model

I have a follow up question from my previous post.
Upon creating mppm models like these:
Str <- hyperframe(str=with(simba, Strauss(mean(nndist(Points)))))
fit0 <- mppm(Points ~ group, simba)
fit1 <- mppm(Points ~ group, simba, interaction=Str,
iformula = ~str + str:id)
Using anova.mppm to run a likelihood ratio test shows that the interaction is highly significant as a whole, but I would also like to test:
whether each individual id shows significant regularity.
whether some groups of ids show significantly stronger inhibition than other groups, for example, whether ids 1-7 are are significantly more regular than ids 8-10.
perform pairwise comparisons of regularity between different ids.
I am aware I could build separate ppm models for each id to test for significant regularity in each id, but I am not sure this is the best approach. Also, I do not think the "summary output" with the p-values for each Strauss interaction parameter can be used for pairwise comparisons other than to the reference level.
Any advice is greatly appreciated.
Thank you!
First let me explain that, for Gibbs models, the likelihood is intractable, so anova.mppm performs the adjusted composite likelihood ratio test, not the likelihood ratio test. However, you can essentially treat this as if it were the likelihood ratio test based on deviance differences.
whether each individual id shows significant regularity
I am aware I could build separate ppm models for each id to test for significant regularity in each id, but I am not sure this is the best approach.
This is appropriate. Use ppm to fit a Strauss model to an individual point pattern, and use anova.ppm to test whether the Strauss interaction is statistically significant.
whether some groups of ids show significantly stronger inhibition than other groups, for example, whether ids 1-7 are are significantly more regular than ids 8-10.
Introduce a new categorical variable (factor) f, say, that separates the two groups that you want to compare. In your model, add the term f:str to the interaction formula; this gives you the alternative hypothesis. The null and alternative models are identical except that the alternative includes the term f:str in the interaction formula. Now apply anova.mppm. Like all analyses of variance, this performs a two-sided test. For the one-sided test, inspect the sign of the coefficient of f:str in the fitted alternative model. If it has the sign that you wanted, report it as significant at the same p-value. Otherwise, report it as non-significant.
perform pairwise comparisons of regularity between different ids.
This is not yet supported (in theory or in software).
[Congratulations, you have reached the boundary of existing methodology!]

How do I deal with 3 categorical IVs, one continuous IV and 4 ordinal DVs?

So, I am writing up a paper for a study I conducted, and I am finding the statistics to be more difficult than I thought. Essentially, I have:
three categorical independent variables, each of which are binary
one categorical variable which is out of 4 points, and
four ordinal DVs, each of which have three levels.
Ideally, what I would want is to be able to run an ANOVA of some sort, but the data do not meet parametric assumptions. I do not know of any tests which would allow me to test all of my IVs and DVs in a single model (the assumptions for linearity, normality, and parametric data are not met). Ideally, I'd also like to run a test where multiple comparisons wouldn't be an issue.
I know that is a lot, so please forgive the long windedness of the question!
Thank you so much for all your help!

Under what conditions can two classes have different average values, yet be indistinguishable to a SVM?

I am asking because I have observed sometimes in neuroimaging that a brain region might have different average activation between two experimental conditions, but sometimes an SVM classifier somehow can't distinguish the patterns of activation between the two conditions.
My intuition is that this might happen in cases where the within-class variance is far greater than the between-class variance. For example, suppose we have two classes, A and B, and that for simplicity our data consists just of integers (rather than vectors). Let the data falling under class A be 0,0,0,0,0,10,10,10,10,10. Let the data falling under class B be 1,1,1,1,1,11,11,11,11,11. Here, A and B are clearly different on average, yet there's no decision boundary that would allow A and B to be distinguished. I believe this logic would hold even if our data consisted of vectors, rather than integers.
Is this a special case of some broader range of cases where an SVM would fail to distinguish two classes that are different on average? Is it possible to delineate the precise conditions under which an SVM classifier would fail to distinguish two classes that differ on average?
EDIT: Assume a linear SVM.
As described in the comments - there are no such conditions because SVM will separate data just fine (I am not talking about any generalisation here, just separating training data). For the rest of the answer I am assuming there are no two identical points with different labels.
Non-linear case
For a kernel case, using something like RBF kernel, SVM will always perfectly separate any training set, given that C is big enough.
Linear case
If data is linearly separable then again - with big enough C it will separate data just fine. If data is not linearly separable, cranking up C as much as possible will lead to smaller and smaller training error (of course it will not get 0 since data is not linearly separable).
In particular for the data you provided kernelized SVM will get 100%, and any linear model will get 50%, but it has nothing to do with means being different or variances relations - it is simply a dataset where any linear separator has at most 50% accuracy, literally every decision point, thus it has nothing to do with SVM. In particular it will separate them "in the middle", meaning that the decision point will be somewhere around "5".

Bayesian t-test assumptions

Good afternoon,
I know that the traditional independent t-test assumes homoscedasticity (i.e., equal variances across groups) and normality of the residuals.
They are usually checked by using levene's test for homogeneity of variances, and the shapiro-wilk test and qqplots for the normality assumption.
Which statistical assumptions do I have to check with the bayesian independent t test? How may I check them in R with coda and rjags?
For whichever test you want to run, find the formula and plug in using the posterior draws of the parameters you have, such as the variance parameter and any regression coefficients that the formula requires. Iterating the formula over the posterior draws will give you a range of values for the test statistic from which you can take the mean to get an average value and the sd to get a standard deviation (uncertainty estimate).
And boom, you're done.
There might be non-parametric Bayesian t-tests. But commonly, Bayesian t-tests are parametric, and as such they assume equality of relevant population variances. If you could obtain a t-value from a t-test (just a regular t-test for your type of t-test from any software package you're comfortable with), use levene's test (do not think this in any way is a dependable test, remember it uses p-value), then you can do a Bayesian t-test. But remember the point that the Bayesian t-test, requires a conventional modeling of observations (Likelihood), and an appropriate prior for the parameter of interest.
It is highly recommended that t-tests be re-parameterized in terms of effect sizes (especially standardized mean difference effect sizes). That is, you focus on the Bayesian estimation of the effect size arising from the t-test not other parameter in the t-test. If you opt to estimate Effect Size from a t-test, then a very easy to use free, online Bayesian t-test software is THIS ONE HERE (probably one of the most user-friendly package available, note that this software uses a cauchy prior for the effect size arising from any type of t-test).
Finally, since you want to do a Bayesian t-test, I would suggest focusing your attention on picking an appropriate/defensible/meaningful prior rather then levenes' test. No test could really show that the sample data may have come from two populations (in your case) that have had equal variances or not unless data is plentiful. Note that the issue that sample data may have come from populations with equal variances itself is an inferential (Bayesian or non-Bayesian) question.

Resources