So I'm taking the weighted average of two complier average treatment effects (CATEs) for anassignment, but I'm not sure how to apportion the appropriate weights. Let me explain why I'm taking this average.
I am given data from a fictional randomized experiment testing the effects of get-out-the-vote efforts on turnout of urban and non-urban areas. Approximately half of the sample is of people who live in urban and non-urban areas, respectively, but they were not randomly assigned to the treatment and control group. That is, the treatment group is about 80% non-urban (the rest urban) and the control group is 80% urban (the rest non-urban). This creates a confounder because, everything else being equal, urbanites were less likely to vote than non-urbanites (at least in the fictional data).
I am being asked to estimate an overall compliance average treatment effect (CATE) for get-out-the-vote interventions while accounting for this confounder. To do this, I found separate a CATE for urban and non-urban parts of the sample, and I need to find an overall estimate from the two CATEs by taking a weighted average of them.
However, I'm not sure how to assign the appropriate weights. My professor has told us to assign more weight to the group that has more variation in the treatment. Since 80% of the treatment group is non-urban, should I assign a weight of .8 to the non-urban CATE and .2 to the urban one? (i.e., overall CATE = (.8)non-urban CATE + (.2)urban CATE)
For background, the data can be found here: https://press.princeton.edu/student-resources/thinking-clearly-with-data. It's the "GOTV_Experiment.csv" data. Thanks in advance for your help!
Related
sorry for this very basic question. I've trawled through previous pages and cannot quite find a case that corresponds to our situation.
320 individuals rated two types of films. The rating was provided on a 1-11 scale.There are many films of each type. In short the DV is a continuous variable.
20 individuals have a particular disease that we now consider of interest. We would like to examine the effect of the disease on the rating.
We conducted a 2-way repeated measures ANOVA, using 'situation type' as a within-subject factor, and 'disease status' as a between-subject factor, using SPSS. The design is obviously unbalanced with more observations in the healthy group. The data appeared to be normally distributed. Levine test suggested equality of variance. Does that mean it is appropriate to use ANOVA for this analysis?
I have a set of patients and their actuarial 1- and 5-years survival. I have also used their data with a certain commonly utilised medical score, that calcualtes survival probability for 1- and 5-years (for example 75% and 55% respectively). I'd like to compare both survival rates.
I did calculate the mean survival probability for all patients at 1- and 5-years as the mean of predicted survival probabilities. I then calculated the mean actuarial survival by using 100% if alive at 1 year and 0% if dead at 5 years. I then compared the means of both groups with a t-test.
I have a feeling that what i am doing is grossly incorrect and goes against all rules of statistics, however i have not find any solution of my problem anywhere. Maybe someone can help me? R packages and codes are welcome.
I have a requirement where I have to verify the transmit power out of a device as measured at its connector is within 2 dB of its expected value over 95% of test measurements.
I am using a signal analyzer to analyze the transmitted power. I only get the average power value, min, max and stdDev of the measurements and not the individual power measurements.
Now, the question is how would I verify the "95% thing" using average power, min, max and stdDev. It seems that I can use normal distribution to find the 95% confidence level.
I would appreciate if someone can help me on this.
Thanks in anticipation
The way I'm reading this, it seems you are a statistical beginner, so if I'm wrong there, the rest of this answer will probably be insultingly basic, and I'm sorry.
Anyway, the idea is that if a dataset is normally distributed, and all the observations are independent of one another, then 95% of the data points will fall within 1.96 standard deviations of the mean.
Do you get identical estimates of average power every time you measure, or are there some slight random differences from reading to reading? My guess is that it's the second. If you were to measure the power a whole bunch of times, and each time you plotted your average power value on a histogram, then that histogram of sample means would have the shape of a bell curve. This bell curve of sample means would have its own mean and standard deviation, and if you have thousands or millions of data points going into the calculation of each average power reading, it's not horrible to assume that it is a normal distribution. The explanation for this phenomenon is known as the 'central limit theorem', and I recommend both the Khan academy's presentation of it as well as the wikipedia page on it.
On the other hand, if your average power is the mean of some small number of data points, like for instance n= 5, or n= 30, then assumption of a normal distribution of sample means can be pretty bad. In this case, your 95% confidence interval around the average power goes from qt(0.975,n-1)*SD/sqrt(n) below the average to qt(0.975,n-1)*SD/sqrt(N) above the average, where qt(0.975,n-1) is the 97.5th percentile of the t distribution with n-1 degrees of freedom, and SD is your measured standard deviation.
I'm trying to find confidence intervals for the means of various variables in a database using SPSS, and I've run into a spot of trouble.
The data is weighted, because each of the people who was surveyed represents a different portion of the overall population. For example, one young man in our sample might represent 28000 young men in the general population. The problem is that SPSS seems to think that the young man's database entries each represent 28000 measurements when they actually just represent one, and this makes SPSS think we have much more data than we actually do. As a result SPSS is giving very very low standard error estimates and very very narrow confidence intervals.
I've tried fixing this by dividing every weight value by the mean weight. This gives plausible figures and an average weight of 1, but I'm not sure the resulting numbers are actually correct.
Is my approach sound? If not, what should I try?
I've been using the Explore command to find mean and standard error (among other things), in case it matters.
You do need to scale weights to the actual sample size, but only the procedures in the Complex Samples option are designed to account for sampling weights properly. The regular weight variable in Statistics is treated as a frequency weight.
I did PCA/FA analysis with and without standardization and end up with different results. For standardization, I just divided each input variable by its corresponding standard deviation. However, I have not subtracted the mean (as in case of Z-scores). My question is how important it is to subtract the mean in case of PCA/FA?
I found on another blog that dividing by std dev is another way of standardizing the data-set. Is this superior to z-scores in any sense? Thanks.
By definition, principal components try to capture highest variation in the data; The important point is that, variation in here is defined as the 2nd norm; not variance and not standard deviation;
For example the first principal component is the linear combination of data in the direction given by:
This matters a lot because
unlike variance, 2nd norm is sensitive to location; in other words, if you add a constant to a vector, the variance will not change but the 2nd norm will change;
unlike standard deviation, 2nd norm is sensitive to scale; i.e. if a vector is multiplied by a constant factor, 2nd norm will scale by that factor;
There are at least two problems if an analysis is impacted by location and scale of explanatory factors:
In reality, observations represent different phenomena, so they have different and incomparable scale and average; for example the variations and average income values are not comparable with variations and average age of a sample population;
You do not want the model results conceptually change if for example incomes are quoted in cents as opposed to dollars, or measurements are done in inches and feet as opposed to meters;
But, plain PCA is sensitive to scale and location; for example, this is a PCA analysis on two dimensional standard normal variables with correlation .4;
The red lines represents the direction of loading vectors; Obviously the first principal component is capturing the highest variation in the joint data, and correctly gives equal shares to each vector;
But things will change dramatically if we move the population 2 units to the right; (equivalent of increasing the average of the first vector by 2 units):
Technically we have the same data as before, but now the first principal component is basically capturing the fact that the first vector has non-zero mean;
Similarly, if the first vector is scaled by a factor of 2:
As can be seen, the first vector has got 4 times more weight than the second vector, simply driven by the fact that it has higher variance.
This shows the importance of normalizing scale and removing mean value from the data before doing PCA;
That said, still one can come up with certain situations that the relative location and scale of the explanatory factors have useful information in the analysis and they should not be wiped out of the data.