Using rdrobust to calculate 2sls and getting error message stating "should be set within the range of x" - statistics

I am trying to calculate a two stage least squares in Stata. My dataset looks like the following:
income bmi health_index asian black q_o_l age aide
100 19 99 1 0 87 23 1
0 21 87 1 0 76 29 0
1002 23 56 0 1 12 47 1
2200 24 67 1 0 73 43 0
2076 21 78 1 0 12 73 1
I am trying to use rdrobust to estimate the treatment effect. I used the following code:
rdrobust q_o_l aide health_index bmi income asian black age, c(10)
I varied the income variable with multiple polynomial forms and used multiple bandwidths. I keep getting the same error message stating:
c() should be set within the range of aide
I am assuming that this has to do with the bandwidth. How can I correct it?

You have two issues with the syntax. You wrote:
rdrobust q_o_l aide health_index bmi income asian black age, c(10)
This will ignore health_index-age variables, since you can only have one running variable. It will then try to use a cutoff of 10 for aide (the second variable is always the running one). Since aide is binary, Stata complains.
It's not obvious to me what makes sense in your setting, but here's an example demonstrating the problem and the two remedies:
. use "http://fmwww.bc.edu/repec/bocode/r/rdrobust_senate.dta", clear
. rdrobust vote margin, c(0) covs(state year class termshouse termssenate population)
Covariate-adjusted sharp RD estimates using local polynomial regression.
Cutoff c = 0 | Left of c Right of c Number of obs = 1108
-------------------+---------------------- BW type = mserd
Number of obs | 491 617 Kernel = Triangular
Eff. Number of obs | 309 279 VCE method = NN
Order est. (p) | 1 1
Order bias (q) | 2 2
BW est. (h) | 17.669 17.669
BW bias (b) | 28.587 28.587
rho (h/b) | 0.618 0.618
Outcome: vote. Running variable: margin.
--------------------------------------------------------------------------------
Method | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------------+------------------------------------------------------------
Conventional | 6.8862 1.3971 4.9291 0.000 4.14804 9.62438
Robust | - - 4.2540 0.000 3.78697 10.258
--------------------------------------------------------------------------------
Covariate-adjusted estimates. Additional covariates included: 6
. sum margin
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
margin | 1,390 7.171159 34.32488 -100 100
. rdrobust vote margin state year class termshouse termssenate population, c(7) // margin rang
> es from -100 to 100
Sharp RD estimates using local polynomial regression.
Cutoff c = 7 | Left of c Right of c Number of obs = 1297
-------------------+---------------------- BW type = mserd
Number of obs | 744 553 Kernel = Triangular
Eff. Number of obs | 334 215 VCE method = NN
Order est. (p) | 1 1
Order bias (q) | 2 2
BW est. (h) | 14.423 14.423
BW bias (b) | 24.252 24.252
rho (h/b) | 0.595 0.595
Outcome: vote. Running variable: margin.
--------------------------------------------------------------------------------
Method | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------------+------------------------------------------------------------
Conventional | .1531 1.7487 0.0875 0.930 -3.27434 3.58053
Robust | - - -0.0718 0.943 -4.25518 3.95464
--------------------------------------------------------------------------------
. rdrobust vote margin state year class termshouse termssenate population, c(-100) // nonsensical
> cutoff for margin
c() should be set within the range of margin
r(125);
end of do-file
r(125);
You might also find this answer interesting.

Related

Comparing means of two variables with the svy prefix in Stata (no ttest)

I am working with survey data and need to compare the means of a couple of variables. Since this is survey data, I need to apply survey weights, requiring the use of the svy prefix. This means that I cannot rely on Stata's ttest command. I essentially need to recreate the results of the following two ttest commands:
ttest bcg_vaccinated == chc_bcg_vaccinated_2, unpaired
ttest bcg_vaccinated == chc_bcg_vaccinated_2
bcg_vaccinated is a self-reported variable on BCG vaccination status while chc_bcg_vaccinated_2 is BCG vaccination status verified against a child health card. You will notice that chc_bcg_vaccinated_2 has missing values. These indicate that the child did not have a health card. So missing indicates no health card, 0 means the vaccination was not given, and finally, 1 means the vaccination was given. But this means that the variables have a different number of non-missing observations.
I have found the solution to the second ttest command, by creating a variable which is a difference between the two vaccination variables:
gen test_diff = bcg_vaccinated - chc_bcg_vaccinated_2
regress test_diff
The above code runs only for the observations where both vaccination variables are non-missing, replicating the paired t-test listed above. Unfortunately, I cannot figure out how to do the first version. The first version would compare the means of both variables on the full set of observations.
Here are some example data for the two variables. Each row represents a different child.
clear
input byte bcg_vaccinated float chc_bcg_vaccinated_2
0 .
1 0
1 1
1 1
1 0
0 .
1 1
1 1
1 1
1 0
0 .
1 1
1 1
0 .
1 1
1 1
1 0
0 .
1 0
1 0
1 0
0 .
0 .
1 1
0 .
You need to get the data into a suitable form for a regression:
. ttest bcg_vaccinated == chc_bcg_vaccinated_2, unpaired
Two-sample t test with equal variances
------------------------------------------------------------------------------
Variable | Obs Mean Std. err. Std. dev. [95% conf. interval]
---------+--------------------------------------------------------------------
bcg_va~d | 25 .68 .095219 .4760952 .4834775 .8765225
chc_bc~2 | 17 .5882353 .1230382 .5072997 .3274059 .8490647
---------+--------------------------------------------------------------------
Combined | 42 .6428571 .0748318 .4849656 .4917312 .7939831
---------+--------------------------------------------------------------------
diff | .0917647 .1536653 -.2188044 .4023338
------------------------------------------------------------------------------
diff = mean(bcg_vaccinated) - mean(chc_bcg_vaccin~2) t = 0.5972
H0: diff = 0 Degrees of freedom = 40
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.7231 Pr(|T| > |t|) = 0.5538 Pr(T > t) = 0.2769
. display r(p)
.5537576
. quietly stack bcg_vaccinated chc_bcg_vaccinated_2, into(vax_status) clear
. quietly recode _stack (1 = 1 "SR") (2 = 0 "CHC"), gen(group) label(group)
. regress vax_status i.group
Source | SS df MS Number of obs = 42
-------------+---------------------------------- F(1, 40) = 0.36
Model | .085210084 1 .085210084 Prob > F = 0.5538
Residual | 9.55764706 40 .238941176 R-squared = 0.0088
-------------+---------------------------------- Adj R-squared = -0.0159
Total | 9.64285714 41 .235191638 Root MSE = .48882
------------------------------------------------------------------------------
vax_status | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
group |
SR | .0917647 .1536653 0.60 0.554 -.2188044 .4023338
_cons | .5882353 .1185553 4.96 0.000 .3486261 .8278445
------------------------------------------------------------------------------
. testparm 1.group
( 1) 1.group = 0
F( 1, 40) = 0.36
Prob > F = 0.5538
. display r(p)
.5537576
The testparm and display are not needed; they just show more digits.

Stata help: null hypothesis against the alternative hypothesis

How can you with Stata test the null hypothesis against the alternative hypothesis. If I have the hypothesis H_0:\beta_1=\beta_2=0 against H_A:\beta_1 ≠ \beta_2 ≠ 0. What will the code be?
This can be done using testparm or test:
. sysuse auto, clear
(1978 Automobile Data)
. replace weight = weight/1000
variable weight was int now float
(74 real changes made)
. reg price mpg weight i.foreign
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(3, 70) = 23.29
Model | 317252879 3 105750960 Prob > F = 0.0000
Residual | 317812517 70 4540178.81 R-squared = 0.4996
-------------+---------------------------------- Adj R-squared = 0.4781
Total | 635065396 73 8699525.97 Root MSE = 2130.8
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg | 21.85361 74.22114 0.29 0.769 -126.1758 169.883
weight | 3464.706 630.749 5.49 0.000 2206.717 4722.695
|
foreign |
Foreign | 3673.06 683.9783 5.37 0.000 2308.909 5037.212
_cons | -5853.696 3376.987 -1.73 0.087 -12588.88 881.4934
------------------------------------------------------------------------------
. test weight=1.foreign=3500
( 1) weight - 1.foreign = 0
( 2) weight = 3500
F( 2, 70) = 0.05
Prob > F = 0.9466
The two-sided p-value is stored in r(p):
. display r(p)
.94664298

Treatment factor variable omitted in stata regression

I'm running a basic difference-in-differences regression model with year and county fixed effects with the following code:
xtreg ln_murder_rate i.treated##i.after_1980 i.year ln_deprivation ln_foreign_born young_population manufacturing low_skill_sector unemployment ln_median_income [weight = mean_population], fe cluster(fips) robust
i.treated is a dichotomous measure of whether or not a county received the treatment over the lifetime of the study and after_1980 measures the post period of the treatment. However, when I run this regression, the estimate for my treatment variable is omitted so I can't really interpret the results. Below is a screen shot of the output. Would love some guidance on what to check so that i can get an estimate for the treated variables prior to treatment.
xtreg ln_murder_rate i.treated##i.after_1980 i.year ln_deprivation ln_foreign_bo
> rn young_population manufacturing low_skill_sector unemployment ln_median_income
> [weight = mean_population], fe cluster(fips) robust
(analytic weights assumed)
note: 1.treated omitted because of collinearity
note: 2000.year omitted because of collinearity
Fixed-effects (within) regression Number of obs = 15,221
Group variable: fips Number of groups = 3,117
R-sq: Obs per group:
within = 0.2269 min = 1
between = 0.1093 avg = 4.9
overall = 0.0649 max = 5
F(12,3116) = 89.46
corr(u_i, Xb) = 0.0502 Prob > F = 0.0000
(Std. Err. adjusted for 3,117 clusters in fips)
---------------------------------------------------------------------------------
| Robust
ln_murder_rate | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
1.treated | 0 (omitted)
1.after_1980 | .2012816 .1105839 1.82 0.069 -.0155431 .4181063
|
treated#|
after_1980 |
1 1 | .0469658 .0857318 0.55 0.584 -.1211307 .2150622
|
year |
1970 | .4026329 .0610974 6.59 0.000 .2828376 .5224282
1980 | .6235034 .0839568 7.43 0.000 .4588872 .7881196
1990 | .4040176 .0525122 7.69 0.000 .3010555 .5069797
2000 | 0 (omitted)
|
ln_deprivation | .3500093 .119083 2.94 0.003 .1165202 .5834983
ln_foreign_born | .0179036 .0616842 0.29 0.772 -.1030421 .1388494
young_populat~n | .0030727 .0081619 0.38 0.707 -.0129306 .0190761
manufacturing | -.0242317 .0073166 -3.31 0.001 -.0385776 -.0098858
low_skill_sec~r | -.0084896 .0088702 -0.96 0.339 -.0258816 .0089025
unemployment | .0335105 .027627 1.21 0.225 -.0206585 .0876796
ln_median_inc~e | -.2423776 .1496396 -1.62 0.105 -.5357799 .0510246
_cons | 2.751071 1.53976 1.79 0.074 -.2679753 5.770118
----------------+----------------------------------------------------------------
sigma_u | .71424066
sigma_e | .62213091
rho | .56859936 (fraction of variance due to u_i)
---------------------------------------------------------------------------------
This is borderline off-topic since this is essentially a statistical question.
The variable treated is dropped because it is time-invariant and you are doing a fixed effects regression, which transforms the data by subtracting the average for each panel for each covariate and outcome. Treated observations all have treated set to one, so when you subtract the average of treated for each panel, which is also one, you get a zero. Similarly for control observations, except they all have treated set to zero. The result is that the treated column is all zeros and Stata drops it because otherwise the matrix is not invertible since there is no variation.
The parameter you care about is treated#after_1980, which is the DID effect and is reported in your output. The fact that treated is dropped is not concerning.

How to obtain a confidence interval for the difference in two proportions in Stata

I would like to obtain a confidence interval for the difference in two proportions.
For example
webuse highschool
tab race sex, col chi2
1=white, |
2=black, | 1=male, 2=female
3=other | male female | Total
-----------+----------------------+----------
White | 1,702 1,850 | 3,552
| 87.82 86.73 | 87.25
-----------+----------------------+----------
Black | 201 249 | 450
| 10.37 11.67 | 11.05
-----------+----------------------+----------
Other | 35 34 | 69
| 1.81 1.59 | 1.69
-----------+----------------------+----------
Total | 1,938 2,133 | 4,071
| 100.00 100.00 | 100.00
Pearson chi2(2) = 1.9652 Pr = 0.374
The difference in the proportion of of white race who are male and female is 87.82 - 86.73 = 1.09 and I would like a confidence interval for this difference.
The prtest command is what you need: prtest sex, by(race). Your variables should not contain more than two groups.
webuse highschool
tab race sex, col chi2
// dummies
gen is_black = (race == 2) if race < 3
gen is_female = (sex == 2) if !mi(sex)
// proportions test
prtest is_female, by(is_black)
You can use the immediate form of -prtest- instead, that is, -prtesti-.
The downside is that you have to input the counts and proportions manually:
With your example:
prtesti 1702 0.8782 1850 0.8673
Two-sample test of proportions x: Number of obs = 1702
y: Number of obs = 1850
------------------------------------------------------------------------------
Variable | Mean Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .8782 .0079276 .8626622 .8937378
y | .8673 .0078874 .851841 .882759
-------------+----------------------------------------------------------------
diff | .0109 .0111829 -.0110181 .0328181
| under Ho: .0112015 0.97 0.331
------------------------------------------------------------------------------
diff = prop(x) - prop(y) z = 0.9731
Ho: diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(Z < z) = 0.8347 Pr(|Z| < |z|) = 0.3305 Pr(Z > z) = 0.1653

Once I've used the regression function in excel, how do I find out the formula it used (y =mx +b)

I have a sample data range that has four categories,
foo | bar | bizz| buzz
---------------------------
163 345 456 2435
232 234 457 2435
123 346 234 3673
Foo is the dependant variable, bar, bizz and buzz are independant variables. I've went to Data Analysis => Regression => picked those columns as appropriate, gotten all of the regression statistics and some plots that represent it. How do I find the formula that it used so that I can use it in my predictions in an application?
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.462484844
R Square 0.213892231
Adjusted R Square 0.212161986
Standard Error 2991.441979
Observations 1367
ANOVA
df SS MS F Significance F
Regression 3 3318714896 1106238299 123.6196536 8.06738E-71
Residual 1363 12197112332 8948725.116
Total 1366 15515827228
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 703.0478619 126.1475776 5.5732173 3.01028E-08 455.5834102 950.5123135 455.5834102 950.5123135
Bar 41.53512531 2.493716675 16.65591193 7.6937E-57 36.64318651 46.42706411 36.64318651 46.42706411
Bizz 1.96479128 0.361015402 5.442402932 6.22595E-08 1.256585224 2.672997336 1.256585224 2.672997336
Buzz 16.77200247 5.419776635 3.094592933 0.002010941 6.139994479 27.40401046 6.139994479 27.40401046
RESIDUAL OUTPUT PROBABILITY OUTPUT
Observation Predicted foo Residuals Standard Residuals Percentile foo
1 6780.632281 34894.36772 11.67756172 0.036576445 63
2 6722.069851 28513.93015 9.542318743 0.109729334 63
3 3382.925842 21471.07416 7.185394378 0.182882224 63
Oh hey, my stats class looks 98% less useless now.
According to that output,
foo = 703.0478619 + 41.53512531 * bar + 1.96479128 * bizz + 16.77200247 * buzz
You can see these values where it lists the coefficients/standard errors for Intercept, Bar, Bizz, and Buzz.
Should probably note that the r squared value is extremely low, which (if I recall correctly) means that the variance in foo is not well explained by the independent variables.

Resources