Use of Chi-Square Test of Independence - statistics

One of my research hypotheses is that individuals from Southeast Asia who are ethnically Chinese are more likely to experience racially motivated hate crimes than their counterparts from other ethnic groups.
Respondents were recruited via non-probability sampling methods for my survey, and the data gathered for the hypothesis above are all nominal, with a sample size of 300, which means that the nonparametric Chi square test of independence is the most appropriate method of analyses.
However, there were 8 choices for ethnic groups (reflecting the heterogeneity of ethnicities in Southeast Asia) including the choice for "Chinese". I am expecting a frequency of < 5 in some of those cells due to the lack of response from individuals from particular ethnic groups. Is it appropriate then / even possible , to combine Chi-square test of independence with a Fisher Exact Test (to be used only for the ethnic groups with expected frequency of < 5)? Otherwise, how else go about the analysis?

Related

2*2 repeated measures ANOVA with an unbalanced number of observations

sorry for this very basic question. I've trawled through previous pages and cannot quite find a case that corresponds to our situation.
320 individuals rated two types of films. The rating was provided on a 1-11 scale.There are many films of each type. In short the DV is a continuous variable.
20 individuals have a particular disease that we now consider of interest. We would like to examine the effect of the disease on the rating.
We conducted a 2-way repeated measures ANOVA, using 'situation type' as a within-subject factor, and 'disease status' as a between-subject factor, using SPSS. The design is obviously unbalanced with more observations in the healthy group. The data appeared to be normally distributed. Levine test suggested equality of variance. Does that mean it is appropriate to use ANOVA for this analysis?

Weighted Average of Two Complier Average Treatment Effects

So I'm taking the weighted average of two complier average treatment effects (CATEs) for anassignment, but I'm not sure how to apportion the appropriate weights. Let me explain why I'm taking this average.
I am given data from a fictional randomized experiment testing the effects of get-out-the-vote efforts on turnout of urban and non-urban areas. Approximately half of the sample is of people who live in urban and non-urban areas, respectively, but they were not randomly assigned to the treatment and control group. That is, the treatment group is about 80% non-urban (the rest urban) and the control group is 80% urban (the rest non-urban). This creates a confounder because, everything else being equal, urbanites were less likely to vote than non-urbanites (at least in the fictional data).
I am being asked to estimate an overall compliance average treatment effect (CATE) for get-out-the-vote interventions while accounting for this confounder. To do this, I found separate a CATE for urban and non-urban parts of the sample, and I need to find an overall estimate from the two CATEs by taking a weighted average of them.
However, I'm not sure how to assign the appropriate weights. My professor has told us to assign more weight to the group that has more variation in the treatment. Since 80% of the treatment group is non-urban, should I assign a weight of .8 to the non-urban CATE and .2 to the urban one? (i.e., overall CATE = (.8)non-urban CATE + (.2)urban CATE)
For background, the data can be found here: https://press.princeton.edu/student-resources/thinking-clearly-with-data. It's the "GOTV_Experiment.csv" data. Thanks in advance for your help!

Assumptions of the Chi-square test of independence

I want to use the Chi-square test of independence to test the following two variables: Student knowledge v.s. course attendance
The null hypothesis is: student knowledge and course attendance (X and Y) are independent
Members in each student knowledge group: Low (12), average(29), high(9)
The results show that there are two degrees of freedom, the chi-square statistic is 0.20, and the p-value is 0.90, and we cannot accept the null hypothesis. I added an image of my test.
click to see the image of the test
I have little doubts regarding the following two issues: the student knowledge groups have an unequal number of participants, the number of participated students in each course is fewer than 10.
My question is: does this test fit for my data?
In case, this test cannot be used for my data, what statistical test I should use instead?
Welcome to stack exchange. Using the Chi-Square test for independence can be an issue with small cell sizes (ie G3, course Y which has a cell count of 2). This has to do with the use of Chi-Square Distribution as an approximation.
I would recommend Fisher's Exact Test. It's usually designated as a tool for small sample sizes, but it is still effective for large samples.

Violation of PH assumption

Running a survival analysis, assume the p-value regarding a variable is statistically significant - let's say with a positive association with the outcome. However, according to the Schoenfeld residuals, the proportional hazard (PH) assumption has is violated.
Which scenario among below could possibly happen after correcting for PH violations?
The p-value may not be significant anymore.
p-value still significant, but the size of HR may change.
p-value still significant, but the direction of association may be altered (i. e. a positive association may end up being negative).
The PH assumption violation usually means that there is an interaction effect that needs to be included in the model. In the simple linear regression, including a new variable may alter the direction of the existing variables' coefficients due to the collinearity. Can we use the same rationale in the case above?
Therneau and Gramsch have written a very useful text, "Modeling Survival Data" that has an entire chapter on testing proportionality. At the end of the chapter is a section on causes and modeling alternatives, which I think can be used for answering this question. Since you mention interactions it makes your question about a particular p-value rather ambiguous and vague.
1) Certainly if you have chosen a particular measurement as the subject of your interest and it turns out the all of the effects are due to its interaction with another variable that you happened to also measure, then you may be in a position where the variable-of-interest's p-value will decrease, possibly to zero.
2) It's almost certain that modification of a model with a different structure (say will the addition of time-varying covariates or a different treatment of time) will result in a different estimated HR for a particular covariate and I think it would be impossible to predict the direction of the change.
3) As to whether to sign of the coefficient could change, I'm quite sure that would be possible as well. The scenario I'm thinking of would have a mixture of two groups say men and women and one of the groups had a sub-group whose early mortality was greatly increased, e.g. breast cancer, while the surviving members of that group would have a more favorable survival expectation. The base model might show a positive coefficient (high risk) while a model that was capable of identifying the subgroup at risk would then allow the gender-related coefficient to become negative (lower risk).

Statistical method to compare urban vs rural matched siblings

I am writing a study protocol for my masters thesis. The study seeks to compare the rates of Non Communicable Diseases and risk factors and determine the effects of rural to urban migration. Sibling pairs will be identified from a rural area. One of the siblings should have participated in the rural NCD survey which is currently on going in the area. The other sibling should have left the area and reported moving to a city.Data will collected by completing a questionnaire on demographics, family history,medical history, diet,alcohol consumption, smoking ,physical activity.This will be done for both the rural and urban sibling, with data on the amount of time spent in urban areas fur
The outcomes which are binary (whether one has a condition or not) are : 1.diabetic, 2.hypertensive, 3.obese
What statistical method can I use to compare the outcomes (stated above) between the two groups, considering that the siblings were matched (one urban sibling for every rural sibling)?
What statistical methods can also be used to explore associations between amount spent in urban residence and the outcomes?
Given that your main aim is to compare quantities of two nominal distributions, a chi-square test seems to be the method of choice with regard to your first question. However, it should be mentioned that a chi-square test is somehow "the smallest" test for answering differences in samples. If you are studying medicine (or related) a chi-square test is fine because it is also frequently applied by researchers of this field. If you are studying psychology or sociology (or related) I'd advise to highlight limitations of the test in the discussions section since it mostly tests your distributions against randomly expected distributions.
Regarding your second question, a logistic regression would be applicable since it allows binomial distributed variables both for independent variables (predictors) and dependent variables. However, if you have other interval scaled variables (e.g. age, weight etc.) you could also use t-tests or ANOVAs to investigate differences between these variables with respect to the existence of specific diseases (i.e. is diabetic or not).
Overall, this matter strongly depends on what you mean by "association". Classically, "association" refers to correlations or linear regression (for which you need interval scaled variables on "both sides") but given your data structure, the aforementioned methods possess a better fit.
How you actually calculate these tests depends on the statistics software used.

Resources