Stata help: null hypothesis against the alternative hypothesis - statistics

How can you with Stata test the null hypothesis against the alternative hypothesis. If I have the hypothesis H_0:\beta_1=\beta_2=0 against H_A:\beta_1 ≠ \beta_2 ≠ 0. What will the code be?

This can be done using testparm or test:
. sysuse auto, clear
(1978 Automobile Data)
. replace weight = weight/1000
variable weight was int now float
(74 real changes made)
. reg price mpg weight i.foreign
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(3, 70) = 23.29
Model | 317252879 3 105750960 Prob > F = 0.0000
Residual | 317812517 70 4540178.81 R-squared = 0.4996
-------------+---------------------------------- Adj R-squared = 0.4781
Total | 635065396 73 8699525.97 Root MSE = 2130.8
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg | 21.85361 74.22114 0.29 0.769 -126.1758 169.883
weight | 3464.706 630.749 5.49 0.000 2206.717 4722.695
|
foreign |
Foreign | 3673.06 683.9783 5.37 0.000 2308.909 5037.212
_cons | -5853.696 3376.987 -1.73 0.087 -12588.88 881.4934
------------------------------------------------------------------------------
. test weight=1.foreign=3500
( 1) weight - 1.foreign = 0
( 2) weight = 3500
F( 2, 70) = 0.05
Prob > F = 0.9466
The two-sided p-value is stored in r(p):
. display r(p)
.94664298

Related

Comparing means of two variables with the svy prefix in Stata (no ttest)

I am working with survey data and need to compare the means of a couple of variables. Since this is survey data, I need to apply survey weights, requiring the use of the svy prefix. This means that I cannot rely on Stata's ttest command. I essentially need to recreate the results of the following two ttest commands:
ttest bcg_vaccinated == chc_bcg_vaccinated_2, unpaired
ttest bcg_vaccinated == chc_bcg_vaccinated_2
bcg_vaccinated is a self-reported variable on BCG vaccination status while chc_bcg_vaccinated_2 is BCG vaccination status verified against a child health card. You will notice that chc_bcg_vaccinated_2 has missing values. These indicate that the child did not have a health card. So missing indicates no health card, 0 means the vaccination was not given, and finally, 1 means the vaccination was given. But this means that the variables have a different number of non-missing observations.
I have found the solution to the second ttest command, by creating a variable which is a difference between the two vaccination variables:
gen test_diff = bcg_vaccinated - chc_bcg_vaccinated_2
regress test_diff
The above code runs only for the observations where both vaccination variables are non-missing, replicating the paired t-test listed above. Unfortunately, I cannot figure out how to do the first version. The first version would compare the means of both variables on the full set of observations.
Here are some example data for the two variables. Each row represents a different child.
clear
input byte bcg_vaccinated float chc_bcg_vaccinated_2
0 .
1 0
1 1
1 1
1 0
0 .
1 1
1 1
1 1
1 0
0 .
1 1
1 1
0 .
1 1
1 1
1 0
0 .
1 0
1 0
1 0
0 .
0 .
1 1
0 .
You need to get the data into a suitable form for a regression:
. ttest bcg_vaccinated == chc_bcg_vaccinated_2, unpaired
Two-sample t test with equal variances
------------------------------------------------------------------------------
Variable | Obs Mean Std. err. Std. dev. [95% conf. interval]
---------+--------------------------------------------------------------------
bcg_va~d | 25 .68 .095219 .4760952 .4834775 .8765225
chc_bc~2 | 17 .5882353 .1230382 .5072997 .3274059 .8490647
---------+--------------------------------------------------------------------
Combined | 42 .6428571 .0748318 .4849656 .4917312 .7939831
---------+--------------------------------------------------------------------
diff | .0917647 .1536653 -.2188044 .4023338
------------------------------------------------------------------------------
diff = mean(bcg_vaccinated) - mean(chc_bcg_vaccin~2) t = 0.5972
H0: diff = 0 Degrees of freedom = 40
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.7231 Pr(|T| > |t|) = 0.5538 Pr(T > t) = 0.2769
. display r(p)
.5537576
. quietly stack bcg_vaccinated chc_bcg_vaccinated_2, into(vax_status) clear
. quietly recode _stack (1 = 1 "SR") (2 = 0 "CHC"), gen(group) label(group)
. regress vax_status i.group
Source | SS df MS Number of obs = 42
-------------+---------------------------------- F(1, 40) = 0.36
Model | .085210084 1 .085210084 Prob > F = 0.5538
Residual | 9.55764706 40 .238941176 R-squared = 0.0088
-------------+---------------------------------- Adj R-squared = -0.0159
Total | 9.64285714 41 .235191638 Root MSE = .48882
------------------------------------------------------------------------------
vax_status | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
group |
SR | .0917647 .1536653 0.60 0.554 -.2188044 .4023338
_cons | .5882353 .1185553 4.96 0.000 .3486261 .8278445
------------------------------------------------------------------------------
. testparm 1.group
( 1) 1.group = 0
F( 1, 40) = 0.36
Prob > F = 0.5538
. display r(p)
.5537576
The testparm and display are not needed; they just show more digits.

IndexError: index 20 is out of bounds for axis 0 with size 20

I am working radio frequencies interference (RFI), I am trying to simulate the RFIs and them to the frequency range as per their classifications i.e., [continuous, intermittent, malfunction]. For the continuous RFIs, they will have to be everywhere. I did the calculations and got the values but I keep getting an error thus I can't plot the values I get.
This is what I have done;
def calculate_RFI(classifications, descriptions, amplitude, freq_samples=64, min_freq=1, max_freq=2, min_HA=-1.5, max_HA=1.5,sampling_H=60*3*2):
'''
This function calculates the RFI dataframe
Parameters
----------
classifications: list
The list of classifications
freq_samples: int
The number of frequency samples
min_freq: float
The minimum frequency in GHz
max_freq: float
The maximum frequency in GHz
min_HA: float
The minimum hour angle in hours
max_HA: float
The maximum hour angle in hours
sampling_H: float
The sampling interval of the hour angle in hours
'''
# create a multidimensional meshgrid of the frequency and hour angle
meshgrid = np.mgrid[min_freq:max_freq:freq_samples*1j,
min_HA:max_HA:sampling_H*1j]
freq = meshgrid[0]
HA = meshgrid[1]
# create frequency and hour angle point size
freq_point_size = (max_freq-min_freq)/freq_samples
HA_point_size = (max_HA-min_HA)/sampling_H
# create an empty RFI array which take the shape of the freq meshgrid
RFI = np.zeros(freq.shape)
point_size_label = np.zeros((freq.shape[0], freq.shape[1], 64))
# print(freq.shape)
# create an empty RFI array which take the shape of the freq meshgrid
RFI = np.zeros(freq.shape)
point_size_label = np.zeros((freq.shape[0], freq.shape[1], 64))
# print(freq.shape)
classification = classifications
description = descriptions
amp = amplitude
# loop through the classifications
for i in range(len(classification)):
# print(classification[i])
# loop through the amplitudes
for j in range(len(amp)):
# check if the classification is continuous
if classification[i] == 'continuous':
# get an array of the amplitude values with the same index as the frequency array
amp_array = amp[j]*np.ones(freq.shape)
# get a 2D array of the start and end frequencies of the current classification
freq_range = rfi_data[rfi_data['Classification'] == classification[i]][['start_freq','end_freq']].values *10**-3 # convert to GHz otherwise it will be THz
# get the RFI of the continuous classification by checking if the frequency is in the range of the current classification
RFI += np.where(np.logical_and(freq >= freq_range[j][0], freq <= freq_range[j][1], RFI), amp_array, RFI)
return RFI
To get a glimpse of the data, this is how the data looks like;
| Frequency | Classification | start_freq | end_freq | amplitude |
| --------- | -------------- | ---------- | -------- | --------- |
| 1000 | intermittent | 1000 | 1000 | 0.299792 |
| 1030 | intermittent | 1030 | 1030 | 0.291061 |
| 1025-1150 | intermittent | 1025 | 1150 | 0.260689 |
| 1090 | intermittent | 1090.0 | 1090.0 | 0.275039 |
| 1166-1186 | continuous | 1166 | 1186 | 0.252776 |
What confuses me is that I am getting the data if I try to print them but keep getting an error when I call the function which reads;
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/var/folders/r8/hbbztfpn4ns1pgl0687hmwp40000gp/T/ipykernel_66490/1561735211.py in <module>
----> 1 calculate_RFI(classifications, descriptions, amplitude)
/var/folders/r8/hbbztfpn4ns1pgl0687hmwp40000gp/T/ipykernel_66490/1341244216.py in calculate_RFI(classifications, descriptions, amplitude, freq_samples, min_freq, max_freq, min_HA, max_HA, sampling_H)
59 freq_range = rfi_data[rfi_data['Classification'] == classification[i]][['start_freq','end_freq']].values *10**-3 # convert to GHz otherwise it will be THz
60 # get the RFI of the continuous classification by checking if the frequency is in the range of the current classification
---> 61 RFI += np.where(np.logical_and(freq >= freq_range[j][0], freq <= freq_range[j][1], RFI), amp_array, RFI)
62
63
IndexError: index 20 is out of bounds for axis 0 with size 20

Using rdrobust to calculate 2sls and getting error message stating "should be set within the range of x"

I am trying to calculate a two stage least squares in Stata. My dataset looks like the following:
income bmi health_index asian black q_o_l age aide
100 19 99 1 0 87 23 1
0 21 87 1 0 76 29 0
1002 23 56 0 1 12 47 1
2200 24 67 1 0 73 43 0
2076 21 78 1 0 12 73 1
I am trying to use rdrobust to estimate the treatment effect. I used the following code:
rdrobust q_o_l aide health_index bmi income asian black age, c(10)
I varied the income variable with multiple polynomial forms and used multiple bandwidths. I keep getting the same error message stating:
c() should be set within the range of aide
I am assuming that this has to do with the bandwidth. How can I correct it?
You have two issues with the syntax. You wrote:
rdrobust q_o_l aide health_index bmi income asian black age, c(10)
This will ignore health_index-age variables, since you can only have one running variable. It will then try to use a cutoff of 10 for aide (the second variable is always the running one). Since aide is binary, Stata complains.
It's not obvious to me what makes sense in your setting, but here's an example demonstrating the problem and the two remedies:
. use "http://fmwww.bc.edu/repec/bocode/r/rdrobust_senate.dta", clear
. rdrobust vote margin, c(0) covs(state year class termshouse termssenate population)
Covariate-adjusted sharp RD estimates using local polynomial regression.
Cutoff c = 0 | Left of c Right of c Number of obs = 1108
-------------------+---------------------- BW type = mserd
Number of obs | 491 617 Kernel = Triangular
Eff. Number of obs | 309 279 VCE method = NN
Order est. (p) | 1 1
Order bias (q) | 2 2
BW est. (h) | 17.669 17.669
BW bias (b) | 28.587 28.587
rho (h/b) | 0.618 0.618
Outcome: vote. Running variable: margin.
--------------------------------------------------------------------------------
Method | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------------+------------------------------------------------------------
Conventional | 6.8862 1.3971 4.9291 0.000 4.14804 9.62438
Robust | - - 4.2540 0.000 3.78697 10.258
--------------------------------------------------------------------------------
Covariate-adjusted estimates. Additional covariates included: 6
. sum margin
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
margin | 1,390 7.171159 34.32488 -100 100
. rdrobust vote margin state year class termshouse termssenate population, c(7) // margin rang
> es from -100 to 100
Sharp RD estimates using local polynomial regression.
Cutoff c = 7 | Left of c Right of c Number of obs = 1297
-------------------+---------------------- BW type = mserd
Number of obs | 744 553 Kernel = Triangular
Eff. Number of obs | 334 215 VCE method = NN
Order est. (p) | 1 1
Order bias (q) | 2 2
BW est. (h) | 14.423 14.423
BW bias (b) | 24.252 24.252
rho (h/b) | 0.595 0.595
Outcome: vote. Running variable: margin.
--------------------------------------------------------------------------------
Method | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------------+------------------------------------------------------------
Conventional | .1531 1.7487 0.0875 0.930 -3.27434 3.58053
Robust | - - -0.0718 0.943 -4.25518 3.95464
--------------------------------------------------------------------------------
. rdrobust vote margin state year class termshouse termssenate population, c(-100) // nonsensical
> cutoff for margin
c() should be set within the range of margin
r(125);
end of do-file
r(125);
You might also find this answer interesting.

Treatment factor variable omitted in stata regression

I'm running a basic difference-in-differences regression model with year and county fixed effects with the following code:
xtreg ln_murder_rate i.treated##i.after_1980 i.year ln_deprivation ln_foreign_born young_population manufacturing low_skill_sector unemployment ln_median_income [weight = mean_population], fe cluster(fips) robust
i.treated is a dichotomous measure of whether or not a county received the treatment over the lifetime of the study and after_1980 measures the post period of the treatment. However, when I run this regression, the estimate for my treatment variable is omitted so I can't really interpret the results. Below is a screen shot of the output. Would love some guidance on what to check so that i can get an estimate for the treated variables prior to treatment.
xtreg ln_murder_rate i.treated##i.after_1980 i.year ln_deprivation ln_foreign_bo
> rn young_population manufacturing low_skill_sector unemployment ln_median_income
> [weight = mean_population], fe cluster(fips) robust
(analytic weights assumed)
note: 1.treated omitted because of collinearity
note: 2000.year omitted because of collinearity
Fixed-effects (within) regression Number of obs = 15,221
Group variable: fips Number of groups = 3,117
R-sq: Obs per group:
within = 0.2269 min = 1
between = 0.1093 avg = 4.9
overall = 0.0649 max = 5
F(12,3116) = 89.46
corr(u_i, Xb) = 0.0502 Prob > F = 0.0000
(Std. Err. adjusted for 3,117 clusters in fips)
---------------------------------------------------------------------------------
| Robust
ln_murder_rate | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
1.treated | 0 (omitted)
1.after_1980 | .2012816 .1105839 1.82 0.069 -.0155431 .4181063
|
treated#|
after_1980 |
1 1 | .0469658 .0857318 0.55 0.584 -.1211307 .2150622
|
year |
1970 | .4026329 .0610974 6.59 0.000 .2828376 .5224282
1980 | .6235034 .0839568 7.43 0.000 .4588872 .7881196
1990 | .4040176 .0525122 7.69 0.000 .3010555 .5069797
2000 | 0 (omitted)
|
ln_deprivation | .3500093 .119083 2.94 0.003 .1165202 .5834983
ln_foreign_born | .0179036 .0616842 0.29 0.772 -.1030421 .1388494
young_populat~n | .0030727 .0081619 0.38 0.707 -.0129306 .0190761
manufacturing | -.0242317 .0073166 -3.31 0.001 -.0385776 -.0098858
low_skill_sec~r | -.0084896 .0088702 -0.96 0.339 -.0258816 .0089025
unemployment | .0335105 .027627 1.21 0.225 -.0206585 .0876796
ln_median_inc~e | -.2423776 .1496396 -1.62 0.105 -.5357799 .0510246
_cons | 2.751071 1.53976 1.79 0.074 -.2679753 5.770118
----------------+----------------------------------------------------------------
sigma_u | .71424066
sigma_e | .62213091
rho | .56859936 (fraction of variance due to u_i)
---------------------------------------------------------------------------------
This is borderline off-topic since this is essentially a statistical question.
The variable treated is dropped because it is time-invariant and you are doing a fixed effects regression, which transforms the data by subtracting the average for each panel for each covariate and outcome. Treated observations all have treated set to one, so when you subtract the average of treated for each panel, which is also one, you get a zero. Similarly for control observations, except they all have treated set to zero. The result is that the treated column is all zeros and Stata drops it because otherwise the matrix is not invertible since there is no variation.
The parameter you care about is treated#after_1980, which is the DID effect and is reported in your output. The fact that treated is dropped is not concerning.

How to obtain a confidence interval for the difference in two proportions in Stata

I would like to obtain a confidence interval for the difference in two proportions.
For example
webuse highschool
tab race sex, col chi2
1=white, |
2=black, | 1=male, 2=female
3=other | male female | Total
-----------+----------------------+----------
White | 1,702 1,850 | 3,552
| 87.82 86.73 | 87.25
-----------+----------------------+----------
Black | 201 249 | 450
| 10.37 11.67 | 11.05
-----------+----------------------+----------
Other | 35 34 | 69
| 1.81 1.59 | 1.69
-----------+----------------------+----------
Total | 1,938 2,133 | 4,071
| 100.00 100.00 | 100.00
Pearson chi2(2) = 1.9652 Pr = 0.374
The difference in the proportion of of white race who are male and female is 87.82 - 86.73 = 1.09 and I would like a confidence interval for this difference.
The prtest command is what you need: prtest sex, by(race). Your variables should not contain more than two groups.
webuse highschool
tab race sex, col chi2
// dummies
gen is_black = (race == 2) if race < 3
gen is_female = (sex == 2) if !mi(sex)
// proportions test
prtest is_female, by(is_black)
You can use the immediate form of -prtest- instead, that is, -prtesti-.
The downside is that you have to input the counts and proportions manually:
With your example:
prtesti 1702 0.8782 1850 0.8673
Two-sample test of proportions x: Number of obs = 1702
y: Number of obs = 1850
------------------------------------------------------------------------------
Variable | Mean Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .8782 .0079276 .8626622 .8937378
y | .8673 .0078874 .851841 .882759
-------------+----------------------------------------------------------------
diff | .0109 .0111829 -.0110181 .0328181
| under Ho: .0112015 0.97 0.331
------------------------------------------------------------------------------
diff = prop(x) - prop(y) z = 0.9731
Ho: diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(Z < z) = 0.8347 Pr(|Z| < |z|) = 0.3305 Pr(Z > z) = 0.1653

Resources