Comparing means of two variables with the svy prefix in Stata (no ttest) - statistics

I am working with survey data and need to compare the means of a couple of variables. Since this is survey data, I need to apply survey weights, requiring the use of the svy prefix. This means that I cannot rely on Stata's ttest command. I essentially need to recreate the results of the following two ttest commands:
ttest bcg_vaccinated == chc_bcg_vaccinated_2, unpaired
ttest bcg_vaccinated == chc_bcg_vaccinated_2
bcg_vaccinated is a self-reported variable on BCG vaccination status while chc_bcg_vaccinated_2 is BCG vaccination status verified against a child health card. You will notice that chc_bcg_vaccinated_2 has missing values. These indicate that the child did not have a health card. So missing indicates no health card, 0 means the vaccination was not given, and finally, 1 means the vaccination was given. But this means that the variables have a different number of non-missing observations.
I have found the solution to the second ttest command, by creating a variable which is a difference between the two vaccination variables:
gen test_diff = bcg_vaccinated - chc_bcg_vaccinated_2
regress test_diff
The above code runs only for the observations where both vaccination variables are non-missing, replicating the paired t-test listed above. Unfortunately, I cannot figure out how to do the first version. The first version would compare the means of both variables on the full set of observations.
Here are some example data for the two variables. Each row represents a different child.
clear
input byte bcg_vaccinated float chc_bcg_vaccinated_2
0 .
1 0
1 1
1 1
1 0
0 .
1 1
1 1
1 1
1 0
0 .
1 1
1 1
0 .
1 1
1 1
1 0
0 .
1 0
1 0
1 0
0 .
0 .
1 1
0 .

You need to get the data into a suitable form for a regression:
. ttest bcg_vaccinated == chc_bcg_vaccinated_2, unpaired
Two-sample t test with equal variances
------------------------------------------------------------------------------
Variable | Obs Mean Std. err. Std. dev. [95% conf. interval]
---------+--------------------------------------------------------------------
bcg_va~d | 25 .68 .095219 .4760952 .4834775 .8765225
chc_bc~2 | 17 .5882353 .1230382 .5072997 .3274059 .8490647
---------+--------------------------------------------------------------------
Combined | 42 .6428571 .0748318 .4849656 .4917312 .7939831
---------+--------------------------------------------------------------------
diff | .0917647 .1536653 -.2188044 .4023338
------------------------------------------------------------------------------
diff = mean(bcg_vaccinated) - mean(chc_bcg_vaccin~2) t = 0.5972
H0: diff = 0 Degrees of freedom = 40
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.7231 Pr(|T| > |t|) = 0.5538 Pr(T > t) = 0.2769
. display r(p)
.5537576
. quietly stack bcg_vaccinated chc_bcg_vaccinated_2, into(vax_status) clear
. quietly recode _stack (1 = 1 "SR") (2 = 0 "CHC"), gen(group) label(group)
. regress vax_status i.group
Source | SS df MS Number of obs = 42
-------------+---------------------------------- F(1, 40) = 0.36
Model | .085210084 1 .085210084 Prob > F = 0.5538
Residual | 9.55764706 40 .238941176 R-squared = 0.0088
-------------+---------------------------------- Adj R-squared = -0.0159
Total | 9.64285714 41 .235191638 Root MSE = .48882
------------------------------------------------------------------------------
vax_status | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
group |
SR | .0917647 .1536653 0.60 0.554 -.2188044 .4023338
_cons | .5882353 .1185553 4.96 0.000 .3486261 .8278445
------------------------------------------------------------------------------
. testparm 1.group
( 1) 1.group = 0
F( 1, 40) = 0.36
Prob > F = 0.5538
. display r(p)
.5537576
The testparm and display are not needed; they just show more digits.

Related

Using rdrobust to calculate 2sls and getting error message stating "should be set within the range of x"

I am trying to calculate a two stage least squares in Stata. My dataset looks like the following:
income bmi health_index asian black q_o_l age aide
100 19 99 1 0 87 23 1
0 21 87 1 0 76 29 0
1002 23 56 0 1 12 47 1
2200 24 67 1 0 73 43 0
2076 21 78 1 0 12 73 1
I am trying to use rdrobust to estimate the treatment effect. I used the following code:
rdrobust q_o_l aide health_index bmi income asian black age, c(10)
I varied the income variable with multiple polynomial forms and used multiple bandwidths. I keep getting the same error message stating:
c() should be set within the range of aide
I am assuming that this has to do with the bandwidth. How can I correct it?
You have two issues with the syntax. You wrote:
rdrobust q_o_l aide health_index bmi income asian black age, c(10)
This will ignore health_index-age variables, since you can only have one running variable. It will then try to use a cutoff of 10 for aide (the second variable is always the running one). Since aide is binary, Stata complains.
It's not obvious to me what makes sense in your setting, but here's an example demonstrating the problem and the two remedies:
. use "http://fmwww.bc.edu/repec/bocode/r/rdrobust_senate.dta", clear
. rdrobust vote margin, c(0) covs(state year class termshouse termssenate population)
Covariate-adjusted sharp RD estimates using local polynomial regression.
Cutoff c = 0 | Left of c Right of c Number of obs = 1108
-------------------+---------------------- BW type = mserd
Number of obs | 491 617 Kernel = Triangular
Eff. Number of obs | 309 279 VCE method = NN
Order est. (p) | 1 1
Order bias (q) | 2 2
BW est. (h) | 17.669 17.669
BW bias (b) | 28.587 28.587
rho (h/b) | 0.618 0.618
Outcome: vote. Running variable: margin.
--------------------------------------------------------------------------------
Method | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------------+------------------------------------------------------------
Conventional | 6.8862 1.3971 4.9291 0.000 4.14804 9.62438
Robust | - - 4.2540 0.000 3.78697 10.258
--------------------------------------------------------------------------------
Covariate-adjusted estimates. Additional covariates included: 6
. sum margin
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
margin | 1,390 7.171159 34.32488 -100 100
. rdrobust vote margin state year class termshouse termssenate population, c(7) // margin rang
> es from -100 to 100
Sharp RD estimates using local polynomial regression.
Cutoff c = 7 | Left of c Right of c Number of obs = 1297
-------------------+---------------------- BW type = mserd
Number of obs | 744 553 Kernel = Triangular
Eff. Number of obs | 334 215 VCE method = NN
Order est. (p) | 1 1
Order bias (q) | 2 2
BW est. (h) | 14.423 14.423
BW bias (b) | 24.252 24.252
rho (h/b) | 0.595 0.595
Outcome: vote. Running variable: margin.
--------------------------------------------------------------------------------
Method | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------------+------------------------------------------------------------
Conventional | .1531 1.7487 0.0875 0.930 -3.27434 3.58053
Robust | - - -0.0718 0.943 -4.25518 3.95464
--------------------------------------------------------------------------------
. rdrobust vote margin state year class termshouse termssenate population, c(-100) // nonsensical
> cutoff for margin
c() should be set within the range of margin
r(125);
end of do-file
r(125);
You might also find this answer interesting.

How to split data and assign it into designated variables?

I have data in Stata regarding the feeling of the current situation. There are seven types of feeling. The data is stored in the following format (note that the data type is a string, and one person can respond to more than 1 answer)
feeling
4,7
1,3,4
2,5,6,7
1,2,3,4,5,6,7
Since the data is a string, I tried to separate it by
split feeling, parse (,)
and I got the result
feeling1
feeling2
feeling3
feeling4
feeling5
feeling6
feeling7
4
7
1
3
4
2
5
6
7
1
2
3
4
5
6
7
However, this is not the result I want. which is that the representative number of feelings should go into the correct variable. For instance.
feeling1
feeling2
feeling3
feeling4
feeling5
feeling6
feeling7
4
7
1
3
4
2
5
6
7
1
2
3
4
5
6
7
I am not sure if there is any built-in command or function for this kind of problem. I am thinking about using forval in looping through every value in each variable and try to juggle it around into the correct variable.
A loop over the distinct values would be enough here. I give your example in a form explained in the Stata tag wiki as more helpful and then give code to get the variables you want as numeric variables.
* Example generated by -dataex-. For more info, type help dataex
clear
input str13 feeling
"4,7"
"1,3,4"
"2,5,6,7"
"1,2,3,4,5,6,7"
end
forval j = 1/7 {
gen wanted`j' = `j' if strpos(feeling, "`j'")
gen better`j' = strpos(feeling, "`j'") > 0
}
l feeling wanted1-better3
+---------------------------------------------------------------------------+
| feeling wanted1 better1 wanted2 better2 wanted3 better3 |
|---------------------------------------------------------------------------|
1. | 4,7 . 0 . 0 . 0 |
2. | 1,3,4 1 1 . 0 3 1 |
3. | 2,5,6,7 . 0 2 1 . 0 |
4. | 1,2,3,4,5,6,7 1 1 2 1 3 1 |
+---------------------------------------------------------------------------+
If you wanted a string result that would be yielded by
gen wanted`j' = "`j'" if strpos(feeling, "`j'")
Had the number of feelings been 10 or more you would have needed more careful code as for example a search for "1" would find it within "10".
Indicator (some say dummy) variables with distinct values 1 or 0 are immensely more valuable for most analysis of this kind of data.
Note Stata-related sources such as
this FAQ
this paper
and this paper.

Stata help: null hypothesis against the alternative hypothesis

How can you with Stata test the null hypothesis against the alternative hypothesis. If I have the hypothesis H_0:\beta_1=\beta_2=0 against H_A:\beta_1 ≠ \beta_2 ≠ 0. What will the code be?
This can be done using testparm or test:
. sysuse auto, clear
(1978 Automobile Data)
. replace weight = weight/1000
variable weight was int now float
(74 real changes made)
. reg price mpg weight i.foreign
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(3, 70) = 23.29
Model | 317252879 3 105750960 Prob > F = 0.0000
Residual | 317812517 70 4540178.81 R-squared = 0.4996
-------------+---------------------------------- Adj R-squared = 0.4781
Total | 635065396 73 8699525.97 Root MSE = 2130.8
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg | 21.85361 74.22114 0.29 0.769 -126.1758 169.883
weight | 3464.706 630.749 5.49 0.000 2206.717 4722.695
|
foreign |
Foreign | 3673.06 683.9783 5.37 0.000 2308.909 5037.212
_cons | -5853.696 3376.987 -1.73 0.087 -12588.88 881.4934
------------------------------------------------------------------------------
. test weight=1.foreign=3500
( 1) weight - 1.foreign = 0
( 2) weight = 3500
F( 2, 70) = 0.05
Prob > F = 0.9466
The two-sided p-value is stored in r(p):
. display r(p)
.94664298

Calculate average of 1kb windows

My files looks like the following:
18 1600014 + CAA 0 3
18 1600017 - CTT 0 1
18 1600019 - CTC 0 1
18 1600020 + CAT 0 3
18 1600031 - CAA 0 1
18 1600035 - CAT 0 1
...
I am trying to calculate the average of column 6 in windows that cover 1000 range of column 2. So from 1600001-1601000, 1601001-1602000, etc. My values go from 1600000-1700000. Is there any way to do this is one step? My initial thought was to use grep to sort these values, but that would require many different commands. I am aware you can calculate the average with awk but can you reiterate over each window?
Desire output would be something like this:
1600001-1601000 3.215
1601001-1602000 3.141
1602001-1603000 3.542
You can use GNU awk to gather the counts and sums, if I understand your problem correct, you might need something like this:
BEGIN { mod = 1000
PROCINFO["sorted_in"] = "#ind_num_asc"
}
{
k= ($2 - ( $2 % mod ) ) / mod
sum[ k ]+= $6
cnt[ k ]++
}
END {
for( k in sum ) printf( "%d-%d\t%6.3f\n", k*mod +1, (k+1)*mod, sum[k] / cnt [k])
}

How to obtain a confidence interval for the difference in two proportions in Stata

I would like to obtain a confidence interval for the difference in two proportions.
For example
webuse highschool
tab race sex, col chi2
1=white, |
2=black, | 1=male, 2=female
3=other | male female | Total
-----------+----------------------+----------
White | 1,702 1,850 | 3,552
| 87.82 86.73 | 87.25
-----------+----------------------+----------
Black | 201 249 | 450
| 10.37 11.67 | 11.05
-----------+----------------------+----------
Other | 35 34 | 69
| 1.81 1.59 | 1.69
-----------+----------------------+----------
Total | 1,938 2,133 | 4,071
| 100.00 100.00 | 100.00
Pearson chi2(2) = 1.9652 Pr = 0.374
The difference in the proportion of of white race who are male and female is 87.82 - 86.73 = 1.09 and I would like a confidence interval for this difference.
The prtest command is what you need: prtest sex, by(race). Your variables should not contain more than two groups.
webuse highschool
tab race sex, col chi2
// dummies
gen is_black = (race == 2) if race < 3
gen is_female = (sex == 2) if !mi(sex)
// proportions test
prtest is_female, by(is_black)
You can use the immediate form of -prtest- instead, that is, -prtesti-.
The downside is that you have to input the counts and proportions manually:
With your example:
prtesti 1702 0.8782 1850 0.8673
Two-sample test of proportions x: Number of obs = 1702
y: Number of obs = 1850
------------------------------------------------------------------------------
Variable | Mean Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .8782 .0079276 .8626622 .8937378
y | .8673 .0078874 .851841 .882759
-------------+----------------------------------------------------------------
diff | .0109 .0111829 -.0110181 .0328181
| under Ho: .0112015 0.97 0.331
------------------------------------------------------------------------------
diff = prop(x) - prop(y) z = 0.9731
Ho: diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(Z < z) = 0.8347 Pr(|Z| < |z|) = 0.3305 Pr(Z > z) = 0.1653

Resources