How to obtain a confidence interval for the difference in two proportions in Stata - statistics

I would like to obtain a confidence interval for the difference in two proportions.
For example
webuse highschool
tab race sex, col chi2
1=white, |
2=black, | 1=male, 2=female
3=other | male female | Total
-----------+----------------------+----------
White | 1,702 1,850 | 3,552
| 87.82 86.73 | 87.25
-----------+----------------------+----------
Black | 201 249 | 450
| 10.37 11.67 | 11.05
-----------+----------------------+----------
Other | 35 34 | 69
| 1.81 1.59 | 1.69
-----------+----------------------+----------
Total | 1,938 2,133 | 4,071
| 100.00 100.00 | 100.00
Pearson chi2(2) = 1.9652 Pr = 0.374
The difference in the proportion of of white race who are male and female is 87.82 - 86.73 = 1.09 and I would like a confidence interval for this difference.

The prtest command is what you need: prtest sex, by(race). Your variables should not contain more than two groups.
webuse highschool
tab race sex, col chi2
// dummies
gen is_black = (race == 2) if race < 3
gen is_female = (sex == 2) if !mi(sex)
// proportions test
prtest is_female, by(is_black)

You can use the immediate form of -prtest- instead, that is, -prtesti-.
The downside is that you have to input the counts and proportions manually:
With your example:
prtesti 1702 0.8782 1850 0.8673
Two-sample test of proportions x: Number of obs = 1702
y: Number of obs = 1850
------------------------------------------------------------------------------
Variable | Mean Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .8782 .0079276 .8626622 .8937378
y | .8673 .0078874 .851841 .882759
-------------+----------------------------------------------------------------
diff | .0109 .0111829 -.0110181 .0328181
| under Ho: .0112015 0.97 0.331
------------------------------------------------------------------------------
diff = prop(x) - prop(y) z = 0.9731
Ho: diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(Z < z) = 0.8347 Pr(|Z| < |z|) = 0.3305 Pr(Z > z) = 0.1653

Related

Using rdrobust to calculate 2sls and getting error message stating "should be set within the range of x"

I am trying to calculate a two stage least squares in Stata. My dataset looks like the following:
income bmi health_index asian black q_o_l age aide
100 19 99 1 0 87 23 1
0 21 87 1 0 76 29 0
1002 23 56 0 1 12 47 1
2200 24 67 1 0 73 43 0
2076 21 78 1 0 12 73 1
I am trying to use rdrobust to estimate the treatment effect. I used the following code:
rdrobust q_o_l aide health_index bmi income asian black age, c(10)
I varied the income variable with multiple polynomial forms and used multiple bandwidths. I keep getting the same error message stating:
c() should be set within the range of aide
I am assuming that this has to do with the bandwidth. How can I correct it?
You have two issues with the syntax. You wrote:
rdrobust q_o_l aide health_index bmi income asian black age, c(10)
This will ignore health_index-age variables, since you can only have one running variable. It will then try to use a cutoff of 10 for aide (the second variable is always the running one). Since aide is binary, Stata complains.
It's not obvious to me what makes sense in your setting, but here's an example demonstrating the problem and the two remedies:
. use "http://fmwww.bc.edu/repec/bocode/r/rdrobust_senate.dta", clear
. rdrobust vote margin, c(0) covs(state year class termshouse termssenate population)
Covariate-adjusted sharp RD estimates using local polynomial regression.
Cutoff c = 0 | Left of c Right of c Number of obs = 1108
-------------------+---------------------- BW type = mserd
Number of obs | 491 617 Kernel = Triangular
Eff. Number of obs | 309 279 VCE method = NN
Order est. (p) | 1 1
Order bias (q) | 2 2
BW est. (h) | 17.669 17.669
BW bias (b) | 28.587 28.587
rho (h/b) | 0.618 0.618
Outcome: vote. Running variable: margin.
--------------------------------------------------------------------------------
Method | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------------+------------------------------------------------------------
Conventional | 6.8862 1.3971 4.9291 0.000 4.14804 9.62438
Robust | - - 4.2540 0.000 3.78697 10.258
--------------------------------------------------------------------------------
Covariate-adjusted estimates. Additional covariates included: 6
. sum margin
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
margin | 1,390 7.171159 34.32488 -100 100
. rdrobust vote margin state year class termshouse termssenate population, c(7) // margin rang
> es from -100 to 100
Sharp RD estimates using local polynomial regression.
Cutoff c = 7 | Left of c Right of c Number of obs = 1297
-------------------+---------------------- BW type = mserd
Number of obs | 744 553 Kernel = Triangular
Eff. Number of obs | 334 215 VCE method = NN
Order est. (p) | 1 1
Order bias (q) | 2 2
BW est. (h) | 14.423 14.423
BW bias (b) | 24.252 24.252
rho (h/b) | 0.595 0.595
Outcome: vote. Running variable: margin.
--------------------------------------------------------------------------------
Method | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------------+------------------------------------------------------------
Conventional | .1531 1.7487 0.0875 0.930 -3.27434 3.58053
Robust | - - -0.0718 0.943 -4.25518 3.95464
--------------------------------------------------------------------------------
. rdrobust vote margin state year class termshouse termssenate population, c(-100) // nonsensical
> cutoff for margin
c() should be set within the range of margin
r(125);
end of do-file
r(125);
You might also find this answer interesting.

Stata help: null hypothesis against the alternative hypothesis

How can you with Stata test the null hypothesis against the alternative hypothesis. If I have the hypothesis H_0:\beta_1=\beta_2=0 against H_A:\beta_1 ≠ \beta_2 ≠ 0. What will the code be?
This can be done using testparm or test:
. sysuse auto, clear
(1978 Automobile Data)
. replace weight = weight/1000
variable weight was int now float
(74 real changes made)
. reg price mpg weight i.foreign
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(3, 70) = 23.29
Model | 317252879 3 105750960 Prob > F = 0.0000
Residual | 317812517 70 4540178.81 R-squared = 0.4996
-------------+---------------------------------- Adj R-squared = 0.4781
Total | 635065396 73 8699525.97 Root MSE = 2130.8
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg | 21.85361 74.22114 0.29 0.769 -126.1758 169.883
weight | 3464.706 630.749 5.49 0.000 2206.717 4722.695
|
foreign |
Foreign | 3673.06 683.9783 5.37 0.000 2308.909 5037.212
_cons | -5853.696 3376.987 -1.73 0.087 -12588.88 881.4934
------------------------------------------------------------------------------
. test weight=1.foreign=3500
( 1) weight - 1.foreign = 0
( 2) weight = 3500
F( 2, 70) = 0.05
Prob > F = 0.9466
The two-sided p-value is stored in r(p):
. display r(p)
.94664298

How do I search for a string in a pandas column and append to the row based on that string?

I have a pandas dataframe and I want to search through strings in column A, if there's a match I want to append 1 to a new column, if there is no match I want to append a 0.
My df currently looks like:
Column A | Column B | Column C
company one | 314 | 0.9
company one toast | 190 | 0.3
www.companyone | 380 | 0.87
companyone home | 850 | 0.1
toaster supplies | 1100 | 0.5
toast rack | 200 | 0.7
...
I'm trying to write a function which will read through column A, and if there's a match with either company one or companyone, then append 1 on the end of the row. If there is no match, then append 0. The output I'm looking for is:
Column A | Column B | Column C | Branded
company one | 314 | 0.9 | 1
company one toast | 190 | 0.3 | 1
www.companyone | 380 | 0.87 | 1
companyone home | 850 | 0.1 | 1
toaster supplies | 1100 | 0.5 | 0
toast rack | 200 | 0.7 | 0
...
I've tried this function:
def branded(table):
if 'company.*?one' in table[table['Column A']]:
table['Branded'] = 1
else:
table['Branded'] = 0
return table.head()
However I get a KeyError. I'm not sure what I'm missing though.
You can do it like this:
df['Branded'] = df['Column A'].str.contains('company.*?one')*1
The solution posted by zipa is better in my opinion. However, thought of sharing this which is a tweak version in case the strings to be looked for are entirely of different pattern. You can add the words to the list and then perform something similar:
import pandas as pd
df = pd.DataFrame({'column':['company one','companyone', 'company two']})
search = ['company one', 'companyone']
string_search = '|'.join(search)
df['flag'] = df['column'].str.contains(string_search)
df['flag'] = df['flag'].map({True: 1, False: 0})

Treatment factor variable omitted in stata regression

I'm running a basic difference-in-differences regression model with year and county fixed effects with the following code:
xtreg ln_murder_rate i.treated##i.after_1980 i.year ln_deprivation ln_foreign_born young_population manufacturing low_skill_sector unemployment ln_median_income [weight = mean_population], fe cluster(fips) robust
i.treated is a dichotomous measure of whether or not a county received the treatment over the lifetime of the study and after_1980 measures the post period of the treatment. However, when I run this regression, the estimate for my treatment variable is omitted so I can't really interpret the results. Below is a screen shot of the output. Would love some guidance on what to check so that i can get an estimate for the treated variables prior to treatment.
xtreg ln_murder_rate i.treated##i.after_1980 i.year ln_deprivation ln_foreign_bo
> rn young_population manufacturing low_skill_sector unemployment ln_median_income
> [weight = mean_population], fe cluster(fips) robust
(analytic weights assumed)
note: 1.treated omitted because of collinearity
note: 2000.year omitted because of collinearity
Fixed-effects (within) regression Number of obs = 15,221
Group variable: fips Number of groups = 3,117
R-sq: Obs per group:
within = 0.2269 min = 1
between = 0.1093 avg = 4.9
overall = 0.0649 max = 5
F(12,3116) = 89.46
corr(u_i, Xb) = 0.0502 Prob > F = 0.0000
(Std. Err. adjusted for 3,117 clusters in fips)
---------------------------------------------------------------------------------
| Robust
ln_murder_rate | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
1.treated | 0 (omitted)
1.after_1980 | .2012816 .1105839 1.82 0.069 -.0155431 .4181063
|
treated#|
after_1980 |
1 1 | .0469658 .0857318 0.55 0.584 -.1211307 .2150622
|
year |
1970 | .4026329 .0610974 6.59 0.000 .2828376 .5224282
1980 | .6235034 .0839568 7.43 0.000 .4588872 .7881196
1990 | .4040176 .0525122 7.69 0.000 .3010555 .5069797
2000 | 0 (omitted)
|
ln_deprivation | .3500093 .119083 2.94 0.003 .1165202 .5834983
ln_foreign_born | .0179036 .0616842 0.29 0.772 -.1030421 .1388494
young_populat~n | .0030727 .0081619 0.38 0.707 -.0129306 .0190761
manufacturing | -.0242317 .0073166 -3.31 0.001 -.0385776 -.0098858
low_skill_sec~r | -.0084896 .0088702 -0.96 0.339 -.0258816 .0089025
unemployment | .0335105 .027627 1.21 0.225 -.0206585 .0876796
ln_median_inc~e | -.2423776 .1496396 -1.62 0.105 -.5357799 .0510246
_cons | 2.751071 1.53976 1.79 0.074 -.2679753 5.770118
----------------+----------------------------------------------------------------
sigma_u | .71424066
sigma_e | .62213091
rho | .56859936 (fraction of variance due to u_i)
---------------------------------------------------------------------------------
This is borderline off-topic since this is essentially a statistical question.
The variable treated is dropped because it is time-invariant and you are doing a fixed effects regression, which transforms the data by subtracting the average for each panel for each covariate and outcome. Treated observations all have treated set to one, so when you subtract the average of treated for each panel, which is also one, you get a zero. Similarly for control observations, except they all have treated set to zero. The result is that the treated column is all zeros and Stata drops it because otherwise the matrix is not invertible since there is no variation.
The parameter you care about is treated#after_1980, which is the DID effect and is reported in your output. The fact that treated is dropped is not concerning.

Retrieving Max in range of one column dictated by another column

My set up is fairly simple. I have paired data where one column is time and the next is a value corresponding to that time point. This recurs for many trials with each trial having a different number of time points
Time Freq
0.216 0.000
0.423 4.835
0.620 5.067
0.784 6.108
0.971 5.355
1.156 5.395
1.311 6.470
1.433 8.170
1.575 7.034
1.752 5.673
1.925 5.758
2.077 6.602
2.180 9.675
2.363 5.477
2.487 8.022
2.616 7.795
2.773 6.344
2.915 7.050
3.074 6.283
3.208 7.495
3.395 5.344
3.535 7.111
3.682 6.839
3.830 6.730
4.023 5.185
This is an example from a table. What I want to do is to create a formulate that will pull the Max Frequency when Time is greater that 1 and less than 3. I know this can be done by manually selecting the range, but I have many different ranges that I want to find the max freq for would like to be able to just input the column.
You can reference upper and lower bounds for the time variable like this:
+---+----+----+-------+
| | D | E | F |
+---+----+----+-------+
| 1 | LB | UB |MaxFreq|
| 2 | 1 | 3 | 9.675 |
| 3 | 0 | 1 | 6.108 |
| 4 | 1 | 2 | 8.17 |
| 5 | 2 | 3 | 9.675 |
+---+----+----+-------+
F2: =MAX(IF(($A$1:$A$26>$D2)*($A$1:$A$26<$E2),$B$1:$B$26))
F2 is an array formula--confirm the entry with the combination Ctrl+Shift+Enter (not just Enter). It can be copied down as far as needed.

Resources