Treatment factor variable omitted in stata regression - statistics

I'm running a basic difference-in-differences regression model with year and county fixed effects with the following code:
xtreg ln_murder_rate i.treated##i.after_1980 i.year ln_deprivation ln_foreign_born young_population manufacturing low_skill_sector unemployment ln_median_income [weight = mean_population], fe cluster(fips) robust
i.treated is a dichotomous measure of whether or not a county received the treatment over the lifetime of the study and after_1980 measures the post period of the treatment. However, when I run this regression, the estimate for my treatment variable is omitted so I can't really interpret the results. Below is a screen shot of the output. Would love some guidance on what to check so that i can get an estimate for the treated variables prior to treatment.
xtreg ln_murder_rate i.treated##i.after_1980 i.year ln_deprivation ln_foreign_bo
> rn young_population manufacturing low_skill_sector unemployment ln_median_income
> [weight = mean_population], fe cluster(fips) robust
(analytic weights assumed)
note: 1.treated omitted because of collinearity
note: 2000.year omitted because of collinearity
Fixed-effects (within) regression Number of obs = 15,221
Group variable: fips Number of groups = 3,117
R-sq: Obs per group:
within = 0.2269 min = 1
between = 0.1093 avg = 4.9
overall = 0.0649 max = 5
F(12,3116) = 89.46
corr(u_i, Xb) = 0.0502 Prob > F = 0.0000
(Std. Err. adjusted for 3,117 clusters in fips)
---------------------------------------------------------------------------------
| Robust
ln_murder_rate | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
1.treated | 0 (omitted)
1.after_1980 | .2012816 .1105839 1.82 0.069 -.0155431 .4181063
|
treated#|
after_1980 |
1 1 | .0469658 .0857318 0.55 0.584 -.1211307 .2150622
|
year |
1970 | .4026329 .0610974 6.59 0.000 .2828376 .5224282
1980 | .6235034 .0839568 7.43 0.000 .4588872 .7881196
1990 | .4040176 .0525122 7.69 0.000 .3010555 .5069797
2000 | 0 (omitted)
|
ln_deprivation | .3500093 .119083 2.94 0.003 .1165202 .5834983
ln_foreign_born | .0179036 .0616842 0.29 0.772 -.1030421 .1388494
young_populat~n | .0030727 .0081619 0.38 0.707 -.0129306 .0190761
manufacturing | -.0242317 .0073166 -3.31 0.001 -.0385776 -.0098858
low_skill_sec~r | -.0084896 .0088702 -0.96 0.339 -.0258816 .0089025
unemployment | .0335105 .027627 1.21 0.225 -.0206585 .0876796
ln_median_inc~e | -.2423776 .1496396 -1.62 0.105 -.5357799 .0510246
_cons | 2.751071 1.53976 1.79 0.074 -.2679753 5.770118
----------------+----------------------------------------------------------------
sigma_u | .71424066
sigma_e | .62213091
rho | .56859936 (fraction of variance due to u_i)
---------------------------------------------------------------------------------

This is borderline off-topic since this is essentially a statistical question.
The variable treated is dropped because it is time-invariant and you are doing a fixed effects regression, which transforms the data by subtracting the average for each panel for each covariate and outcome. Treated observations all have treated set to one, so when you subtract the average of treated for each panel, which is also one, you get a zero. Similarly for control observations, except they all have treated set to zero. The result is that the treated column is all zeros and Stata drops it because otherwise the matrix is not invertible since there is no variation.
The parameter you care about is treated#after_1980, which is the DID effect and is reported in your output. The fact that treated is dropped is not concerning.

Related

Using rdrobust to calculate 2sls and getting error message stating "should be set within the range of x"

I am trying to calculate a two stage least squares in Stata. My dataset looks like the following:
income bmi health_index asian black q_o_l age aide
100 19 99 1 0 87 23 1
0 21 87 1 0 76 29 0
1002 23 56 0 1 12 47 1
2200 24 67 1 0 73 43 0
2076 21 78 1 0 12 73 1
I am trying to use rdrobust to estimate the treatment effect. I used the following code:
rdrobust q_o_l aide health_index bmi income asian black age, c(10)
I varied the income variable with multiple polynomial forms and used multiple bandwidths. I keep getting the same error message stating:
c() should be set within the range of aide
I am assuming that this has to do with the bandwidth. How can I correct it?
You have two issues with the syntax. You wrote:
rdrobust q_o_l aide health_index bmi income asian black age, c(10)
This will ignore health_index-age variables, since you can only have one running variable. It will then try to use a cutoff of 10 for aide (the second variable is always the running one). Since aide is binary, Stata complains.
It's not obvious to me what makes sense in your setting, but here's an example demonstrating the problem and the two remedies:
. use "http://fmwww.bc.edu/repec/bocode/r/rdrobust_senate.dta", clear
. rdrobust vote margin, c(0) covs(state year class termshouse termssenate population)
Covariate-adjusted sharp RD estimates using local polynomial regression.
Cutoff c = 0 | Left of c Right of c Number of obs = 1108
-------------------+---------------------- BW type = mserd
Number of obs | 491 617 Kernel = Triangular
Eff. Number of obs | 309 279 VCE method = NN
Order est. (p) | 1 1
Order bias (q) | 2 2
BW est. (h) | 17.669 17.669
BW bias (b) | 28.587 28.587
rho (h/b) | 0.618 0.618
Outcome: vote. Running variable: margin.
--------------------------------------------------------------------------------
Method | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------------+------------------------------------------------------------
Conventional | 6.8862 1.3971 4.9291 0.000 4.14804 9.62438
Robust | - - 4.2540 0.000 3.78697 10.258
--------------------------------------------------------------------------------
Covariate-adjusted estimates. Additional covariates included: 6
. sum margin
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
margin | 1,390 7.171159 34.32488 -100 100
. rdrobust vote margin state year class termshouse termssenate population, c(7) // margin rang
> es from -100 to 100
Sharp RD estimates using local polynomial regression.
Cutoff c = 7 | Left of c Right of c Number of obs = 1297
-------------------+---------------------- BW type = mserd
Number of obs | 744 553 Kernel = Triangular
Eff. Number of obs | 334 215 VCE method = NN
Order est. (p) | 1 1
Order bias (q) | 2 2
BW est. (h) | 14.423 14.423
BW bias (b) | 24.252 24.252
rho (h/b) | 0.595 0.595
Outcome: vote. Running variable: margin.
--------------------------------------------------------------------------------
Method | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------------+------------------------------------------------------------
Conventional | .1531 1.7487 0.0875 0.930 -3.27434 3.58053
Robust | - - -0.0718 0.943 -4.25518 3.95464
--------------------------------------------------------------------------------
. rdrobust vote margin state year class termshouse termssenate population, c(-100) // nonsensical
> cutoff for margin
c() should be set within the range of margin
r(125);
end of do-file
r(125);
You might also find this answer interesting.

Stata help: null hypothesis against the alternative hypothesis

How can you with Stata test the null hypothesis against the alternative hypothesis. If I have the hypothesis H_0:\beta_1=\beta_2=0 against H_A:\beta_1 ≠ \beta_2 ≠ 0. What will the code be?
This can be done using testparm or test:
. sysuse auto, clear
(1978 Automobile Data)
. replace weight = weight/1000
variable weight was int now float
(74 real changes made)
. reg price mpg weight i.foreign
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(3, 70) = 23.29
Model | 317252879 3 105750960 Prob > F = 0.0000
Residual | 317812517 70 4540178.81 R-squared = 0.4996
-------------+---------------------------------- Adj R-squared = 0.4781
Total | 635065396 73 8699525.97 Root MSE = 2130.8
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg | 21.85361 74.22114 0.29 0.769 -126.1758 169.883
weight | 3464.706 630.749 5.49 0.000 2206.717 4722.695
|
foreign |
Foreign | 3673.06 683.9783 5.37 0.000 2308.909 5037.212
_cons | -5853.696 3376.987 -1.73 0.087 -12588.88 881.4934
------------------------------------------------------------------------------
. test weight=1.foreign=3500
( 1) weight - 1.foreign = 0
( 2) weight = 3500
F( 2, 70) = 0.05
Prob > F = 0.9466
The two-sided p-value is stored in r(p):
. display r(p)
.94664298

Average of ratios ("on the fly") in excel table

I have this excel table used as a DB named "csv" :
Ticket agent_wait client_wait
1 200 105
2 10 50
3 172 324
I'd like to calculate the average of the ratios of the agent wait. ration_agent being calculated as agent_wait / (agent_wait + client_wait).
If the table were like this:
Ticket agent_wait client_wait ratio_agent
1 200 105 0.65
2 10 50 0.16
3 172 324 0.24
I'd just do the average of the ratio_agent column with =AVERAGE(csv[ratio_agent]).
The problem is that this last column does not exist and I don't want to create an additional column just for this calculation.
Is there a way to do this with only a formula ?
I already tried
=AVERAGE(csv[agent_wait]/(csv[agent_wait]+csv[client_wait])) but it gives me the answer for only one line.
You can use the formula you have used, but you need to enter it as an array formula. What this means is, after entering the formula, do not press Enter, but hold Ctrl+Shift and then press Enter. The resulting formula will turn into this after you do that:
{=AVERAGE(csv[agent_wait]/(csv[agent_wait]+csv[client_wait]))}
And give your the value you are looking for. Use the correct columns (first csv[agent_want] to csv[client_wait]) if you are looking for the average client_wait instead.
It has come to me that your question might be an XY problem. Please take a read on this answer. It might help you decide on what you are actually looking for.
In brief if you want a measure of how much time:
agents spend waiting, out of all the waiting between agents and clients, calculate the totals first and get the average of these totals. Outliers e.g. a special case where an agent spent lots more time on a client than the client themselves will heavily affect this measure. Use this measure if you want to know how much time agents spend waiting when opposed to how much clients wait.
=SUM(csv(agent_wait)/sum(csv[agent_wait]+csv[client_wait]))
agents each spend waiting on any particular call, calculate the ratios first then the average of these. Outliers will not affect this measure by much and give an expected ratio of time an agent might spend on any interaction with a client. Use this measure if you want to have a guideline as to how much an agent should spend waiting for each unit of time a client spends waiting.
=AVERAGE(csv[agent_wait]/(csv[agent_wait]+csv[client_wait]))
It also wouldn't be correct to do the =AVERAGE(csv[ration_agent]) calculation. An average of the average isn't the overall average. You need to sum the parts and then compute the overall average using those parts.
Ticket | agent_wait | client_wait | ratio_agent
------ | ---------- | ----------- | -----------
1 | 200 | 105 | 0.656
2 | 10 | 50 | 0.167
3 | 172 | 324 | 0.347
Total | 382 | 479 | ?????
The question is what goes in for the ?????.
If you take the average of the ratio_agent column (i.e. =AVERAGE(Table1[ratio_agent])) then you get 0.390.
But if you compute the ratio again, but with the column totals, like =csv[[#Totals],[agent_wait]]/(csv[[#Totals],[agent_wait]]+csv[[#Totals],[client_wait]]), then you get the true answer: 0.444.
To see how this is true try this set of data:
Ticket | agent_wait | client_wait | ratio_agent
------ | ---------- | ----------- | -----------
1 | 2000 | 2000 | 0.500
2 | 10 | 1 | 0.909
Total | 2010 | 2001 |
The average of the two ratios is 0.705, but it should be clear that if the total agent wait was 2010 and the total client wait was 2001 then the true average ratio must be closer to 0.500.
Computing it using the correct calculation you get 0.501.

Scaling values with a known upper limit

I have a column of values in Excel that I need to modify by a scale factor. Original column example:
| Value |
|:-----:|
| 75 |
| 25 |
| 25 |
| 50 |
| 0 |
| 0 |
| 100 |
Scale factor: 1.5
| Value |
|:-----:|
| 112.5 |
| 37.5 |
| 37.5 |
| 75 |
| 0 |
| 0 |
| 150 |
The problem is I need them to be within a range of 0-100. My first thought was take them as percentages of 100, but then quickly realized that this would be going in circles.
Is there some sort of mathematical method or Excel formula I could use to handle this so that I actually make meaningful changes to the values, such that when these numbers are modified, 150 is 100 but 37.5 might not be 25 and I'm not just canceling out my scale factor?
Assuming your data begin in cell A1, you can use this formula:
=MIN(100,A1*1.5)
Copy downward as needed.
You could do something like:
ScaledValue = (v - MIN(AllValues)) / (MAX(AllValues) - MIN(AllValues)) * (SCALE_MAX - SCALE_MIN) + SCALE_MIN
Say your raw data (a.k.a. AllValues) ranges from a MIN of 15 to a MAX of 83, and you want to scale it to a range of 0 to 100. To do that you would set SCALE_MIN = 0 and SCALE_MAX = 100. In the above equation, v is any single value in the data.
Hope that helps
Another option is:
ScaledValue = PERCENTRANK.INC(AllValues, v)
In contrast to my earlier suggestion, (linear --- preserves relative spacing of the data points), this preserves the order of the data but not spacing. Using PERCENTRANK.INC will have the effect that sparse data will get compressed closer together, and bunched data will get spread out.
You could also do a weighted combination of the two methods --- give the linear method a weight of say 0.5 so that relative spacing is partially preserved.

Stata tabstat change order/sort?

I am using tabstat in Stata, and using estpost and esttab to get its output to LaTeX. I have
tabstat
to display statistics by group. For example,
tabstat assets, by(industry) missing statistics(count mean sd p25 p50 p75)
The question I have is whether there is a way for tabstat (or other Stata commands) to display the output ordered by the value of the mean, so that those categories that have higher means will be on top. By default, Stata displays by alphabetical order of industry when I use tabstat.
tabstat does not offer such a hook, but there is an approach to problems like this that is general and quite easy to understand.
You don't provide a reproducible example, so we need one:
. sysuse auto, clear
(1978 Automobile Data)
. gen Make = word(make, 1)
. tab Make if foreign
Make | Freq. Percent Cum.
------------+-----------------------------------
Audi | 2 9.09 9.09
BMW | 1 4.55 13.64
Datsun | 4 18.18 31.82
Fiat | 1 4.55 36.36
Honda | 2 9.09 45.45
Mazda | 1 4.55 50.00
Peugeot | 1 4.55 54.55
Renault | 1 4.55 59.09
Subaru | 1 4.55 63.64
Toyota | 3 13.64 77.27
VW | 4 18.18 95.45
Volvo | 1 4.55 100.00
------------+-----------------------------------
Total | 22 100.00
Make here is like your variable industry: it is a string variable, so in tables Stata will tend to show it in alphabetical (alphanumeric) order.
The work-around has several easy steps, some optional.
Calculate a variable on which you want to sort. egen is often useful here.
. egen mean_mpg = mean(mpg), by(Make)
Map those values to a variable with distinct integer values. As two groups could have the same mean (or other summary statistic), make sure you break ties on the original string variable.
. egen group = group(mean_mpg Make)
This variable is created to have value 1 for the group with the lowest mean (or other summary statistic), 2 for the next lowest, and so forth. If the opposite order is desired, as in this question, flip the grouping variable around.
. replace group = -group
(74 real changes made)
There is a problem with this new variable: the values of the original string variable, here Make, are nowhere to be seen. labmask (to be installed from the Stata Journal website after search labmask) is a helper here. We use the values of the original string variable as the value labels of the new variable. (The idea is that the value labels become the "mask" that the integer variable wears.)
. labmask group, values(Make)
Optionally, work at the variable label of the new integer variable.
. label var group "Make"
Now we can tabulate using the categories of the new variable.
. tabstat mpg if foreign, s(mean) by(group) format(%2.1f)
Summary for variables: mpg
by categories of: group (Make)
group | mean
--------+----------
Subaru | 35.0
Mazda | 30.0
VW | 28.5
Honda | 26.5
Renault | 26.0
Datsun | 25.8
BMW | 25.0
Toyota | 22.3
Fiat | 21.0
Audi | 20.0
Volvo | 17.0
Peugeot | 14.0
--------+----------
Total | 24.8
-------------------
Note: other strategies are sometimes better or as good here.
If you collapse your data to a new dataset, you can then sort it as you please.
graph bar and graph dot are good at displaying summary statistics over groups, and the sort order can be tuned directly.
UPDATE 3 and 5 October 2021 A new helper command myaxis from SSC and the Stata Journal (see [paper here) condenses the example here with tabstat:
* set up data example
sysuse auto, clear
gen Make = word(make, 1)
* sort order variable and tabulation
myaxis Make2 = Make, sort(mean mpg) descending
tabstat mpg if foreign, s(mean) by(Make2) format(%2.1f)
I would look at the egenmore package on SSC. You can get that package by typing in Stata ssc install egenmore. In particular, I would look at the entry for axis() in the helpfile of egenmore. That contains an example that does exactly what you want.

Resources