Important measures to assess the relationship between two variables - statistics

What other measures are important for assessing the relationship between two variables apart from correlation, determination and significance?

Related

Can I include covariates outside of the minimally sufficient set in a causal framework that aren't in the causal pathway?

I am applying a causal method to a cohort study analysis on pollutant exposure and disease X. Based on our understanding of the disease, we believe that aging is the only confounder.
From what I understand, age would be the item in our minimally sufficient set required to evaluate the outcome/exposure relationship.
Assuming all other causal assumptions are met, does the minimally sufficient set represent the only variable that should be included in the model outside of the exposure?
Could I still include covariates like smoking history and gender that effect the outcome versus age which effects the outcome and the exposure?
Please help! I can’t seem to find anything conclusive online. I want to include the other covariates because I feel their effect sizes contextualize the effect of the exposure.
Yes, you can add additional variables to your analysis. They will either be good, neutral, or bad, depending on the causal structure of your problem.
I strongly recommend the paper A Crash Course in Good and Bad Controls by Cinelli, Forney, and Pearl, for a comprehensive classification of possible cases.
Your description of gender and smoking status affecting only the outcome seems to comply with model 8 in the paper. These are, in general, good variables to add, since they will help explaining the variance of the outcome, therefore reducing the variance left for the treatment to explain - practically increasing the precision of the treatment effect estimation.

Testing significance of Strauss parameters in mppm model

I have a follow up question from my previous post.
Upon creating mppm models like these:
Str <- hyperframe(str=with(simba, Strauss(mean(nndist(Points)))))
fit0 <- mppm(Points ~ group, simba)
fit1 <- mppm(Points ~ group, simba, interaction=Str,
iformula = ~str + str:id)
Using anova.mppm to run a likelihood ratio test shows that the interaction is highly significant as a whole, but I would also like to test:
whether each individual id shows significant regularity.
whether some groups of ids show significantly stronger inhibition than other groups, for example, whether ids 1-7 are are significantly more regular than ids 8-10.
perform pairwise comparisons of regularity between different ids.
I am aware I could build separate ppm models for each id to test for significant regularity in each id, but I am not sure this is the best approach. Also, I do not think the "summary output" with the p-values for each Strauss interaction parameter can be used for pairwise comparisons other than to the reference level.
Any advice is greatly appreciated.
Thank you!
First let me explain that, for Gibbs models, the likelihood is intractable, so anova.mppm performs the adjusted composite likelihood ratio test, not the likelihood ratio test. However, you can essentially treat this as if it were the likelihood ratio test based on deviance differences.
whether each individual id shows significant regularity
I am aware I could build separate ppm models for each id to test for significant regularity in each id, but I am not sure this is the best approach.
This is appropriate. Use ppm to fit a Strauss model to an individual point pattern, and use anova.ppm to test whether the Strauss interaction is statistically significant.
whether some groups of ids show significantly stronger inhibition than other groups, for example, whether ids 1-7 are are significantly more regular than ids 8-10.
Introduce a new categorical variable (factor) f, say, that separates the two groups that you want to compare. In your model, add the term f:str to the interaction formula; this gives you the alternative hypothesis. The null and alternative models are identical except that the alternative includes the term f:str in the interaction formula. Now apply anova.mppm. Like all analyses of variance, this performs a two-sided test. For the one-sided test, inspect the sign of the coefficient of f:str in the fitted alternative model. If it has the sign that you wanted, report it as significant at the same p-value. Otherwise, report it as non-significant.
perform pairwise comparisons of regularity between different ids.
This is not yet supported (in theory or in software).
[Congratulations, you have reached the boundary of existing methodology!]

Does fitting a linear regression and performing t test will give similar results?

I am trying to predict the statistically significant variables out of a list of binary variables. I am having a conceptual doubt in the below mentioned 2 approaches to find the relevant variables.
Dependent variable:
Height of a person
Independent variables:
Gender(Male or Female)
Financial_Status(Below Poverty Line or not)
College_Graduate(Yes or No)
Approach 1: Fitting a linear regression while taking these as dependent/independent variables and finding the statistically significant variables
Approach 2: Performing an individual statistical test for each dependent variable(t-test or some other relevant test) to compute the statistically significant variables
Are both of these approaches similar and will give similar results? If not, what's the exact difference?
Since you have multiple independent variables, than clearly no.
If you would like to go for the ttest approach for each of the values of the different independent variables (Gender, Financial_Status and College_Graduate) then it means you'll perform 3 different tests. Performing multiple tests is something that is risky in terms of false positive results, and thus should be adjusted with a multiple comparison adjustment method (Bonferoni, FDR, among others).
On the other hand, if you'll use a single multiavariate linear regression you wouldn't have the correct for multiple comparisons, which is why, in my opinion, is the better approach.

Violation of PH assumption

Running a survival analysis, assume the p-value regarding a variable is statistically significant - let's say with a positive association with the outcome. However, according to the Schoenfeld residuals, the proportional hazard (PH) assumption has is violated.
Which scenario among below could possibly happen after correcting for PH violations?
The p-value may not be significant anymore.
p-value still significant, but the size of HR may change.
p-value still significant, but the direction of association may be altered (i. e. a positive association may end up being negative).
The PH assumption violation usually means that there is an interaction effect that needs to be included in the model. In the simple linear regression, including a new variable may alter the direction of the existing variables' coefficients due to the collinearity. Can we use the same rationale in the case above?
Therneau and Gramsch have written a very useful text, "Modeling Survival Data" that has an entire chapter on testing proportionality. At the end of the chapter is a section on causes and modeling alternatives, which I think can be used for answering this question. Since you mention interactions it makes your question about a particular p-value rather ambiguous and vague.
1) Certainly if you have chosen a particular measurement as the subject of your interest and it turns out the all of the effects are due to its interaction with another variable that you happened to also measure, then you may be in a position where the variable-of-interest's p-value will decrease, possibly to zero.
2) It's almost certain that modification of a model with a different structure (say will the addition of time-varying covariates or a different treatment of time) will result in a different estimated HR for a particular covariate and I think it would be impossible to predict the direction of the change.
3) As to whether to sign of the coefficient could change, I'm quite sure that would be possible as well. The scenario I'm thinking of would have a mixture of two groups say men and women and one of the groups had a sub-group whose early mortality was greatly increased, e.g. breast cancer, while the surviving members of that group would have a more favorable survival expectation. The base model might show a positive coefficient (high risk) while a model that was capable of identifying the subgroup at risk would then allow the gender-related coefficient to become negative (lower risk).

Statistical method to compare urban vs rural matched siblings

I am writing a study protocol for my masters thesis. The study seeks to compare the rates of Non Communicable Diseases and risk factors and determine the effects of rural to urban migration. Sibling pairs will be identified from a rural area. One of the siblings should have participated in the rural NCD survey which is currently on going in the area. The other sibling should have left the area and reported moving to a city.Data will collected by completing a questionnaire on demographics, family history,medical history, diet,alcohol consumption, smoking ,physical activity.This will be done for both the rural and urban sibling, with data on the amount of time spent in urban areas fur
The outcomes which are binary (whether one has a condition or not) are : 1.diabetic, 2.hypertensive, 3.obese
What statistical method can I use to compare the outcomes (stated above) between the two groups, considering that the siblings were matched (one urban sibling for every rural sibling)?
What statistical methods can also be used to explore associations between amount spent in urban residence and the outcomes?
Given that your main aim is to compare quantities of two nominal distributions, a chi-square test seems to be the method of choice with regard to your first question. However, it should be mentioned that a chi-square test is somehow "the smallest" test for answering differences in samples. If you are studying medicine (or related) a chi-square test is fine because it is also frequently applied by researchers of this field. If you are studying psychology or sociology (or related) I'd advise to highlight limitations of the test in the discussions section since it mostly tests your distributions against randomly expected distributions.
Regarding your second question, a logistic regression would be applicable since it allows binomial distributed variables both for independent variables (predictors) and dependent variables. However, if you have other interval scaled variables (e.g. age, weight etc.) you could also use t-tests or ANOVAs to investigate differences between these variables with respect to the existence of specific diseases (i.e. is diabetic or not).
Overall, this matter strongly depends on what you mean by "association". Classically, "association" refers to correlations or linear regression (for which you need interval scaled variables on "both sides") but given your data structure, the aforementioned methods possess a better fit.
How you actually calculate these tests depends on the statistics software used.

Resources