Generating Samples from Distribution - statistics

I am in the process of learning about statistics, and let's say I have an outcome from some experiment:
1 | 0.34
2 | 0.10
3 | 0.05
4 | 0.13
5 | 0.13
6 | 0.25
I am interested generating samples using a uniform random number generator from this distribution. Any suggestions?

This is a very standard problem with a very standard solution. Form an array where each entry contains not the probability of that index, but the sum of all probabilities up to that index. For your example problem, the array is p[1] = 0.34, p[2] = 0.44, p[3] = 0.49, etc. Use your uniform RNG to generate u between 0 and 1. Then find the index i such that p[i-1] < u < p[i]. For a very small array like this, you can use linear search, but for a large array you will want to use binary search. Notice that you can re-use the array to generate multiple deviates, so don't re-form it every time.

Related

Histogram bars with different bar colors

I would like to get a histogram with alternating gradients of a given color, according to decile breakpoints, as shown in figure below:
Example data:
clear
input float dn3001 double hw0010
1219000 2823.89408574376
-16390 520.112200750285
121010 238.732322261911
953839 221.316063150235
465000 247.280750487467
-870 280.305382323347
96000 2946.16661611018
69500 355.33497718705
113000 1421.43087298696
30500 616.914514202173
20000 3389.34765405599
154000 305.674687642557
440500 525.694777777734
56870 1823.24691219821
330500 376.651172915574
101000 465.098273950744
401046.5 660.816203440777
31872 1693.02190101773
220345 603.326244510505
193360 677.527413164373
196300 568.436679602066
222640 427.051692314575
510500 318.557431587468
131450 1388.72862441839
122300 532.996690473983
305 2441.72289873923
313500 292.610321722557
184500 2699.67735757755
1615564.6 386.944439319246
126528 3018.77523617479
711110 511.604491869939
127440 256.968118266053
424900 1620.366555701
95491 3097.46262561529
287500 413.119620218929
70050 2119.47171174278
75460 299.232446656805
210500 290.391474820414
135800 292.141670444933
119924 303.953183619671
81075 1568.41438245214
152 289.175871985445
73000 2551.12752046544
246500 327.474430367518
159960 2350.26463245568
14522 456.56909870547
139000 319.451311193507
68661 2771.34087931684
214089.7 388.589383036063
927800 849.088069585408
7840 1512.71702946577
140140 852.940547469624
21646.566 2405.47949923772
end
The code below produces a graph with uneven bar spread:
xtile aux = dn3001 [aw=hw0010], nq(10)
_pctile dn3001[aw=hw0010], nq(10)
sort dn3001
list dn3001 aux
return list
scalar p10=r(r1)
scalar p20=r(r2)
scalar p30=r(r3)
scalar p40=r(r4)
scalar p50=r(r5)
scalar p60=r(r6)
scalar p70=r(r7)
scalar p80=r(r8)
scalar p90=r(r9)
drop aux
sum dn3001 [aw=hw0010], d
scalar p1=r(p1)
scalar p95=r(p95)
twoway histogram dn3001 if dn3001>=scalar(p1) & dn3001<scalar(p10), bcolor(green%20) freq legend(off) ///
|| histogram dn3001 if dn3001>=scalar(p10) & dn3001<scalar(p20), bcolor(green) freq legend(off) ///
|| histogram dn3001 if dn3001>=scalar(p20) & dn3001<scalar(p30), bcolor(green%20) freq legend(off) ///
|| histogram dn3001 if dn3001>=scalar(p30) & dn3001<scalar(p40), bcolor(green) freq legend(off) ///
|| histogram dn3001 if dn3001>=scalar(p40) & dn3001<scalar(p50), bcolor(green%20) freq legend(off) ///
|| histogram dn3001 if dn3001>=scalar(p50) & dn3001<scalar(p60), bcolor(green) freq legend(off) ///
|| histogram dn3001 if dn3001>=scalar(p60) & dn3001<scalar(p70), bcolor(green%20) freq legend(off) ///
|| histogram dn3001 if dn3001>=scalar(p70) & dn3001<scalar(p80), bcolor(green) freq legend(off) ///
|| histogram dn3001 if dn3001>=scalar(p80) & dn3001<scalar(p90), bcolor(green%20) freq legend(off) ///
|| histogram dn3001 if dn3001>=scalar(p90) & dn3001<scalar(p95), bcolor(green) freq legend(off)
How can I get the same bar width?
Here is one potential approach:
twoway__histogram_gen dn3001, freq bin(50) generate(b a, replace)
_pctile dn3001 [aw=hw0010], nq(10)
return list
scalars:
r(r1) = 20000
r(r2) = 30500
r(r3) = 68661
r(r4) = 75460
r(r5) = 96000
r(r6) = 126528
r(r7) = 159960
r(r8) = 196300
r(r9) = 440500
generate group = .
forvalues i = 9(-1)1 {
replace group = `i' if a <= `r(r`i')'
}
replace group = 10 if a > `r(r9)' & _n <= 20
list a b group in 1 / 20, sepby(group)
+-----------------------+
| a b group |
|-----------------------|
1. | -70.45375 6 1 |
|-----------------------|
2. | 32568.64 4 3 |
3. | 65207.73 7 3 |
|-----------------------|
4. | 97846.82 4 6 |
|-----------------------|
5. | 130485.9 9 7 |
|-----------------------|
6. | 163125 2 8 |
7. | 195764.1 4 8 |
|-----------------------|
8. | 228403.2 3 9 |
9. | 261042.3 1 9 |
10. | 293681.4 1 9 |
11. | 326320.5 2 9 |
12. | 391598.7 1 9 |
13. | 424237.8 2 9 |
|-----------------------|
14. | 456876.8 1 10 |
15. | 522155 1 10 |
16. | 717989.6 1 10 |
17. | 913824.1 1 10 |
18. | 946463.3 1 10 |
19. | 1207576 1 10 |
20. | 1599245 1 10 |
+-----------------------+
Result:
twoway (bar b a, barwidth(25000) legend(off)) ///
(bar b a if group == 3, barwidth(25000) color(green)) ///
(bar b a if group == 9, barwidth(25000) color(red))
More a comment (or a series of comments) than an answer you seek, but the graph won't fit in a comment.
Your approach looks doomed -- if not to failure, then to extreme difficulty.
There is no guarantee whatsoever that any of your quantile bin limits will match any of the histogram bin limits.
Similarly, there is no guarantee that the difference between adjacent quantiles is a simple multiple of any histogram bin width you might choose. You might be tempted to fudge this by colouring a bar according to whichever quantile bin was more frequent, but that would be ignoring details. So suppose your histogram bar was for [100, 200) but some values in that interval belong to one quantile bin and some to another: what would you do? And what would you do if 3 or more quantile bins fell within a histogram bar?
By specifying multiple histograms without specifying starts or bin widths you are unleashing anarchy. Stata will make separate decisions for each histogram based partly on sample sizes. That's what your code is telling it to do, but not what you want.
Your histograms don't know anything about the analytic weights you used.
Beyond that, your question raises all sorts of unnecessary puzzles.
Why produce aux and do nothing with it? It's a point of standard art on SO to show the minimum code necessary to explain your problem.
You say you are interested in deciles but inconsistently are also working with 1 and 95% percentiles.
Why you have such irregular values with very different weights is unclear but inessential for your immediate question. But all that inclines me to think that you cannot get a histogram like your example graph easily or effectively from your data. You have just 53 data points and so weights make no difference to your being unable to have more than 53 non-empty bins.
How the bin limits fall relative to the data can be shown directly without a histogram.
With your example data (thanks!) I do this
xtile aux = dn3001 [aw=hw0010], nq(10)
quantile dn3001, ms(none) mla(aux) mlabpos(0) scheme(s1color) rlopts(lc(none))
I would use a logarithmic scale ordinarily but negative values rule that out.
Here I go beyond strict programming issues, but the question inevitably raises the issues I address.

Treatment factor variable omitted in stata regression

I'm running a basic difference-in-differences regression model with year and county fixed effects with the following code:
xtreg ln_murder_rate i.treated##i.after_1980 i.year ln_deprivation ln_foreign_born young_population manufacturing low_skill_sector unemployment ln_median_income [weight = mean_population], fe cluster(fips) robust
i.treated is a dichotomous measure of whether or not a county received the treatment over the lifetime of the study and after_1980 measures the post period of the treatment. However, when I run this regression, the estimate for my treatment variable is omitted so I can't really interpret the results. Below is a screen shot of the output. Would love some guidance on what to check so that i can get an estimate for the treated variables prior to treatment.
xtreg ln_murder_rate i.treated##i.after_1980 i.year ln_deprivation ln_foreign_bo
> rn young_population manufacturing low_skill_sector unemployment ln_median_income
> [weight = mean_population], fe cluster(fips) robust
(analytic weights assumed)
note: 1.treated omitted because of collinearity
note: 2000.year omitted because of collinearity
Fixed-effects (within) regression Number of obs = 15,221
Group variable: fips Number of groups = 3,117
R-sq: Obs per group:
within = 0.2269 min = 1
between = 0.1093 avg = 4.9
overall = 0.0649 max = 5
F(12,3116) = 89.46
corr(u_i, Xb) = 0.0502 Prob > F = 0.0000
(Std. Err. adjusted for 3,117 clusters in fips)
---------------------------------------------------------------------------------
| Robust
ln_murder_rate | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
1.treated | 0 (omitted)
1.after_1980 | .2012816 .1105839 1.82 0.069 -.0155431 .4181063
|
treated#|
after_1980 |
1 1 | .0469658 .0857318 0.55 0.584 -.1211307 .2150622
|
year |
1970 | .4026329 .0610974 6.59 0.000 .2828376 .5224282
1980 | .6235034 .0839568 7.43 0.000 .4588872 .7881196
1990 | .4040176 .0525122 7.69 0.000 .3010555 .5069797
2000 | 0 (omitted)
|
ln_deprivation | .3500093 .119083 2.94 0.003 .1165202 .5834983
ln_foreign_born | .0179036 .0616842 0.29 0.772 -.1030421 .1388494
young_populat~n | .0030727 .0081619 0.38 0.707 -.0129306 .0190761
manufacturing | -.0242317 .0073166 -3.31 0.001 -.0385776 -.0098858
low_skill_sec~r | -.0084896 .0088702 -0.96 0.339 -.0258816 .0089025
unemployment | .0335105 .027627 1.21 0.225 -.0206585 .0876796
ln_median_inc~e | -.2423776 .1496396 -1.62 0.105 -.5357799 .0510246
_cons | 2.751071 1.53976 1.79 0.074 -.2679753 5.770118
----------------+----------------------------------------------------------------
sigma_u | .71424066
sigma_e | .62213091
rho | .56859936 (fraction of variance due to u_i)
---------------------------------------------------------------------------------
This is borderline off-topic since this is essentially a statistical question.
The variable treated is dropped because it is time-invariant and you are doing a fixed effects regression, which transforms the data by subtracting the average for each panel for each covariate and outcome. Treated observations all have treated set to one, so when you subtract the average of treated for each panel, which is also one, you get a zero. Similarly for control observations, except they all have treated set to zero. The result is that the treated column is all zeros and Stata drops it because otherwise the matrix is not invertible since there is no variation.
The parameter you care about is treated#after_1980, which is the DID effect and is reported in your output. The fact that treated is dropped is not concerning.

Excel: Sum cells if they share an identical unknown string

I have 154,901 rows of data that look like this:
Text String | 340
Where "Text String" represents a variable string that has no other pattern or order to it and cannot be predicted in any mathematical way, and 340 represents a random integer. How can I find the sum of all of the values sharing an identical string, and organize this data based on total per unique string?
For example, say I have the dataset
Alpha | 3
Alpha | 6
Beta | 4
Gamma | 1
Gamma | 3
Gamma | 8
Omega | 10
I'm looking for some way to present the data as:
Alpha | 9
Beta | 4
Gamma | 12
Omega | 10
The point of this being that I have a dataset so large that I cannot enumerate this manually, and I have a finite yet unknown amount of strings that I cannot reliably predict what they are.
Consider using a pivot table, and then aggregate the numbers by string. This is probably the least ugly option. – Tim Biegeleisen

Stata tabstat change order/sort?

I am using tabstat in Stata, and using estpost and esttab to get its output to LaTeX. I have
tabstat
to display statistics by group. For example,
tabstat assets, by(industry) missing statistics(count mean sd p25 p50 p75)
The question I have is whether there is a way for tabstat (or other Stata commands) to display the output ordered by the value of the mean, so that those categories that have higher means will be on top. By default, Stata displays by alphabetical order of industry when I use tabstat.
tabstat does not offer such a hook, but there is an approach to problems like this that is general and quite easy to understand.
You don't provide a reproducible example, so we need one:
. sysuse auto, clear
(1978 Automobile Data)
. gen Make = word(make, 1)
. tab Make if foreign
Make | Freq. Percent Cum.
------------+-----------------------------------
Audi | 2 9.09 9.09
BMW | 1 4.55 13.64
Datsun | 4 18.18 31.82
Fiat | 1 4.55 36.36
Honda | 2 9.09 45.45
Mazda | 1 4.55 50.00
Peugeot | 1 4.55 54.55
Renault | 1 4.55 59.09
Subaru | 1 4.55 63.64
Toyota | 3 13.64 77.27
VW | 4 18.18 95.45
Volvo | 1 4.55 100.00
------------+-----------------------------------
Total | 22 100.00
Make here is like your variable industry: it is a string variable, so in tables Stata will tend to show it in alphabetical (alphanumeric) order.
The work-around has several easy steps, some optional.
Calculate a variable on which you want to sort. egen is often useful here.
. egen mean_mpg = mean(mpg), by(Make)
Map those values to a variable with distinct integer values. As two groups could have the same mean (or other summary statistic), make sure you break ties on the original string variable.
. egen group = group(mean_mpg Make)
This variable is created to have value 1 for the group with the lowest mean (or other summary statistic), 2 for the next lowest, and so forth. If the opposite order is desired, as in this question, flip the grouping variable around.
. replace group = -group
(74 real changes made)
There is a problem with this new variable: the values of the original string variable, here Make, are nowhere to be seen. labmask (to be installed from the Stata Journal website after search labmask) is a helper here. We use the values of the original string variable as the value labels of the new variable. (The idea is that the value labels become the "mask" that the integer variable wears.)
. labmask group, values(Make)
Optionally, work at the variable label of the new integer variable.
. label var group "Make"
Now we can tabulate using the categories of the new variable.
. tabstat mpg if foreign, s(mean) by(group) format(%2.1f)
Summary for variables: mpg
by categories of: group (Make)
group | mean
--------+----------
Subaru | 35.0
Mazda | 30.0
VW | 28.5
Honda | 26.5
Renault | 26.0
Datsun | 25.8
BMW | 25.0
Toyota | 22.3
Fiat | 21.0
Audi | 20.0
Volvo | 17.0
Peugeot | 14.0
--------+----------
Total | 24.8
-------------------
Note: other strategies are sometimes better or as good here.
If you collapse your data to a new dataset, you can then sort it as you please.
graph bar and graph dot are good at displaying summary statistics over groups, and the sort order can be tuned directly.
UPDATE 3 and 5 October 2021 A new helper command myaxis from SSC and the Stata Journal (see [paper here) condenses the example here with tabstat:
* set up data example
sysuse auto, clear
gen Make = word(make, 1)
* sort order variable and tabulation
myaxis Make2 = Make, sort(mean mpg) descending
tabstat mpg if foreign, s(mean) by(Make2) format(%2.1f)
I would look at the egenmore package on SSC. You can get that package by typing in Stata ssc install egenmore. In particular, I would look at the entry for axis() in the helpfile of egenmore. That contains an example that does exactly what you want.

Referring to objects using variable strings in R

Edit: Thanks to those who have responded so far; I'm very much a beginner in R and have just taken on a large project for my MSc dissertation so am a bit overwhelmed with the initial processing. The data I'm using is as follows (from WMO publically available rainfall data):
120 6272100 KHARTOUM 15.60 32.55 382 1899 1989 0.0
1899 0.03 0.03 0.03 0.03 0.03 1.03 13.03 12.03 9999 6.03 0.03 0.03
1900 0.03 0.03 0.03 0.03 0.03 23.03 80.03 47.03 23.03 8.03 0.03 0.03
1901 0.03 0.03 0.03 0.03 0.03 17.03 23.03 17.03 0.03 8.03 0.03 0.03
(...)
120 6272101 JEBEL AULIA 15.20 32.50 380 1920 1988 0.0
1920 0.03 0.03 0.03 0.00 0.03 6.90 20.00 108.80 47.30 1.00 0.01 0.03
1921 0.03 0.03 0.03 0.00 0.03 0.00 88.00 57.00 35.00 18.50 0.01 0.03
1922 0.03 0.03 0.03 0.00 0.03 0.00 87.50 102.30 10.40 15.20 0.01 0.03
(...)
There are ~100 observation stations that I'm interested in, each of which has a varying start and end date for rainfall measurements. They're formatted as above in a single data file, with stations separated by "120 (station number) (station name)".
I need first to separate this file by station, then to extract March, April, May and June for each year, then take a total of these months for each year. So far I'm messing around with loops (as below), but I understand this isn't the right way to go about it and would rather learn some better technique.
Thanks again for the help!
(Original question:)
I've got a large data set containing rainfall by season for ~100 years over 100+ locations. I'm trying to separate this data into more managable arrays, and in particular I want to retrieve the sum of the rainfall for March, April, May and June for each station for each year.
The following is a simplified version of my code so far:
a <- array(1,dim=c(10,12))
for (i in 1:5) {
all data:
assign(paste("station_",i,sep=""), a)
#march - june data:
assign(paste("station_",i,"_mamj",sep=""), a[,4:7])
}
So this gives me station_(i)__mamj_ which contains the data for the months I'm interested in for each station. Now I want to sum each row of this array and enter it in a new array called station_(i)_mamj_tot. Simple enough in theory, but I can't work out how to reference station_(i)_mamj so that it varies the value of i with each iteration. Any help much appreciated!
This is totally begging for a dataframe, then it's just this one-liner with power-tools like ddply (amazingly powerful):
tot_mamj <- ddply(rain[rain$month %in% 3:6,-2], 'year', colwise(sum))
giving your aggregate of total for M/A/M/J, by year:
year station_1 station_2 station_3 station_4 station_5 ...
1 1972 8.618960 5.697739 10.083192 9.264512 11.152378 ...
2 1973 18.571748 18.903280 11.832462 18.262272 10.509621 ...
3 1974 22.415201 22.670821 32.850745 31.634717 20.523778 ...
4 1975 16.773286 17.683704 18.259066 14.996550 19.007762 ...
...
Below is perfectly working code. We create a dataframe whose col.names are 'station_n'; also extra columns for year and month (factor, or else integer if you're lazy, see the footnote). Now you can do arbitrary analysis by month or year (using plyr's split-apply-combine paradigm):
require(plyr) # for d*ply, summarise
#require(reshape) # for melt
# Parameterize everything here, it's crucial for testing/debugging
all_years <- c(1970:2011)
nYears <- length(all_years)
nStations <- 101
# We want station names as vector of chr (as opposed to simple indices)
station_names <- paste ('station_', 1:nStations, sep='')
rain <- data.frame(cbind(
year=rep(c(1970:2011),12),
month=1:12
))
# Fill in NAs for all data
rain[,station_names] <- as.numeric(NA)
# Make 'month' a factor, to prevent any numerical funny stuff e.g accidentally 'aggregating' it
rain$month <- factor(rain$month)
# For convenience, store the row indices for all years, M/A/M/J
I.mamj <- which(rain$month %in% 3:6)
# Insert made-up seasonal data for M/A/M/J for testing... leave everything else NA intentionally
rain[I.mamj,station_names] <- c(3,5,9,6) * runif(4*nYears*nStations)
# Get our aggregate of MAMJ totals, by year
# The '-2' column index means: "exclude month, to prevent it also getting 'aggregated'"
excludeMonthCol = -2
tot_mamj <- ddply(rain[rain$month %in% 3:6, excludeMonthCol], 'year', colwise(sum))
# voila!!
# year station_1 station_2 station_3 station_4 station_5
# 1 1972 8.618960 5.697739 10.083192 9.264512 11.152378
# 2 1973 18.571748 18.903280 11.832462 18.262272 10.509621
# 3 1974 22.415201 22.670821 32.850745 31.634717 20.523778
# 4 1975 16.773286 17.683704 18.259066 14.996550 19.007762
As a footnote, before I converted month from numeric to factor, it was getting silently 'aggregated' (until I put in the '-2': exclude column reference).
However, better still is when you make it a factor, it will refuse point-blank to be aggregate'd, and throw an error (which is desirable for debugging):
ddply(rain[rain$month %in% 3:6, ], 'year', colwise(sum))
Error in Summary.factor(c(3L, 3L, 3L, 3L, 3L, 3L), na.rm = FALSE) :
sum not meaningful for factors
For your original question, use get():
i <- 10
var <- paste("test", i, sep="_")
assign(10, var)
get(var)
As David said, this is probably not the best path to be taking, but it can be useful at times (and IMO the assign/get construct is far better than eval(parse))
Why are you using assign to create variables like station1, station2, station_3_mamj and so on? It would be much easier and more intuitive to store them in a list, like stations[[1]], stations[[2]], stations_mamj[[3]], and such. Then each could be accessed using their index.
Since it looks like each piece of per-station data you're working with is a matrix of the same size, you could even deal with them as a three-dimensional matrix.
ETA: Incidentally, if you really want to solve the problem this way, you would do:
eval(parse(text=paste("station", i, "mamj", sep="_")))
But don't- using eval is almost always bad practices, and will make it difficult to do even simple operations on your data.

Resources