This might sound bonkers, but looking to see if there are any ideas on how to do this.
I have N categories (say 7) where a set number of people (say 1000) have to be allocated. I know from historical data the minimum and maximum for each category (there is limited historical data, say 15 samples, so I have data that looks like this - if I had a larger sample, I would try to generate a distribution for each category from all the samples, but there isn't.
-Year 1: [78 97 300 358 132 35 0]
-Year 2: [24 74 346 300 148 84 22]
-.
-.
-Year 15:[25 85 382 302 146 52 8]
The min and max for each category over these 15 years of data is:
Min: [25 74 252 278 112 27 0 ]
Max: [132 141 382 360 177 84 22]
I am trying to scale this using simulation - by allocating 1000 to each category within the min and max limits, and repeating it. The only condition is that the sum of the allocation across the seven categories in each simulation has to sum to 1000.
Any ideas would be greatly appreciated!
The distribution you want is called the multinomial distribution. You can use the RandMultinomial function in SAS/IML to produce random samples from the multinomial distribution. To use the multinomial distribution, you need to know the probability of an individual in each category. If this probability has not changed over time, the best estimate of this probability is to take the average proportion in each category.
Thus, I would recommend using ALL the data to estimate the probability, not just max and min:
proc iml;
X = {...}; /* X is a 15 x 7 matrix of counts, each row is a year */
mean = mean(X);
p = mean / sum(mean);
/* simulate new counts by using the multinomial distribution */
numSamples = 10;
SampleSize = 1000;
Y = randmultinomial(numSamples, SampleSize, p);
print Y;
Now, if you insist on using the max/min, you could use the midrange to estimate the most likely value and use that to estimate the probabilty, as follows:
Min = {25 74 252 278 112 27 0};
Max = {132 141 382 360 177 84 22};
/* use midrange to estimate probabilities */
midrange = (Min + Max)/2;
p = midrange / sum(midrange);
/* now use RandMultinomial, as before */
If you use the second method, there is no guarantee that the simulated values will not exceed the Min/Max values, although in practice many of the samples will obey that criterion.
Personally, I advocate the first method, which uses the average count. Or you can use a time-weighted count, if you think recent observations are more relevant than observations from 15 years ago.
Related
I was following this question to address a similar situation:
How to Calculate Area Under the Curve in Spotfire?
My data is in the following format:
PLANT
OBS_DATE_RECORDED
TRAIT_VALUE
period
A
3/16/2021
225
A3/16/2021
A
3/23/2021
227
A3/23/2021
A
3/30/2021
220
A3/30/2021
A
4/7/2021
240
A4/7/2021
A
4/13/2021
197
A4/13/2021
A
4/20/2021
197
A4/20/2021
A
4/27/2021
218
A4/27/2021
B
3/16/2021
253
B3/16/2021
B
3/23/2021
274
B3/23/2021
B
3/30/2021
271
B3/30/2021
B
4/7/2021
257
B4/7/2021
B
4/13/2021
250
B4/13/2021
A
4/20/2021
241
A4/20/2021
B
4/27/2021
255
B4/27/2021
Following the answer's formula as a calculated column:
([TRAIT_VALUE] + Avg([TRAIT_VALUE]) over (Intersect(NextPeriod([period]),[PLANT]))) / 2 * (Avg([OBS_DATE_RECORDED]) over (Intersect(NextPeriod([period]),[PLANT])) - [OBS_DATE_RECORDED])
However, the results don't appear correct.
AUDPC
1603.19:59:59.928
1608.17:59:59.956
2924.20:0:0.100
7732.21:0:0.000
1395.14:41:44.404
1461.23:30:0.050
-4393.7:59:59.712
I think the problem might be the date format but don't understand the formula well enough to troubleshoot. In Excel I usually compute the AUDPC by using the SUMPRODUCTS multiplying the days between two dates by the average TRAIT_VALUE between those two dates.
I am trying to apply normalization to my data and I have tried the Conventional scaling techniques using sklearn packages readily available for this kind of requirement. However, I am looking to implement something called Decimal scaling.
I read about it in this research paper and looks like a technique which can improve results of a neural network regression. As per my understanding, this is what I believe needs to be done -
Suppose the range of attribute X is −4856 to 28. The maximum absolute value of X is 4856.
To normalize by decimal scaling I will need to divide each value by 10000 (c = 4). In this case, −4856 becomes −0.4856 while 28 becomes 0.0028.
So for all values: new value = old value/ 10^c
How can I reproduce this as a function in Python so as to normalize all the features(column by column) in my data set?
Input:
A B C
30 90 75
56 168 140
28 84 70
369 1107 922.5
485 1455 1212.5
4856 14568 12140
40 120 100
56 168 140
45 135 112.5
78 234 195
899 2697 2247.5
Output:
A B C
0.003 0.0009 0.0075
0.0056 0.00168 0.014
0.0028 0.00084 0.007
0.0369 0.01107 0.09225
0.0485 0.01455 0.12125
0.4856 0.14568 1.214
0.004 0.0012 0.01
0.0056 0.00168 0.014
0.0045 0.00135 0.01125
0.0078 0.00234 0.0195
0.0899 0.02697 0.22475
Thank you guys for asking questions which led me to think about my problem more clearly and break it into steps. I have arrived to a solution. Here's how my solution looks like:
def Dec_scale(df):
for x in df:
p = df[x].max()
q = len(str(abs(p)))
df[x] = df[x]/10**q
I hope this solution looks agreeable!
def decimal_scaling (df):
df_abs = abs(df)
max_valus= df_abs.max()
log_num=[]
for i in range(max_valus.shape[0]):
log_num.append(int(math.log10(max_valus[i]))+1)
log_num = np.array(log_num)
log_num = [pow(10, number) for number in log_num]
X_full =df/log_num
return X_full
Using integers ONLY (no floating-point), is there a way to determine between two fractions, which result is greater?
for example say we have these two fraction:
1000/51 = 19(.60) && 1000/52 = 19(.23)
If we were to use floating point numbers obviously the first fraction is greater; however, both fractions equal 19 if we were to use integers only. How might one find out which is greater with out using floating point math?
I have tried to get the remainder using the % operator but does not seem to work in all cases.
1/2 can be think one apple give two people, so every people take 0.5 apple.
so 1000/51 consider as 1000 apples give 51 people.
1000/51 > 1000/52, because the apple the same,but we wanna give it to more people.
it is simple example, more complex exmaple:
1213/109 1245/115 which is greater?
1245 is greater than 1213 and 115 is greater than 109, difference:
1245 - 1213 = 32, and 115 - 109 = 6, 32/6 replce 1245/109, compare 1213/109 to 32/6.
32/6 ≈ 5 and less 6, 6*109 = 654 < 1213, so 1213/109 > 1245/115.
1213/109 1245/115
1213/109 32/6 # make diff 1245 - 1213 = 32 115 - 109 = 6
# compare diff to 1213/109
1213 > 109 * 6
# then
1213/109 > 1245/115
170! approaches the limit of a floating point double: 171! will overflow.
However 170! is over 300 digits long.
There is, therefore, no way that 170! can be represented precisely in floating point.
Yet Excel returns the correct answer for 170! / 169!.
Why is this? I'd expect some error to creep in, but it returns an integral value. Does Excel somehow know how to optimise this calculation?
If you find the closest doubles to 170! and 169!, they are
double oneseventy = 5818033100654137.0 * 256;
double onesixtynine = 8761273375102700.0;
times the same power of two. The closest double to the quotient of these is exactly 170.0.
Also, Excel may compute 170! by multiplying 169! by 170.
William Kahan has a paper called "How Futile are Mindless Assessments of Roundoff in Floating-Point Computation?" where he discusses some of the insanity that goes on in Excel. It may be that Excel is not computing 170 exactly, but rather it's hiding an ulp of reality from you.
The answer of tmyklebu is already perfect. But I wanted to know more.
What if implementation of n! was something trivial as return double(n)*(n-1)!...
Here is a Smalltalk snippet, but you can translate in many other languages, that's not the point:
(2 to: 170) count: [:n |
| num den |
den := (2 to: n - 1) inject: 1.0 into: [:p :e | p*e].
num := n*den.
num / den ~= n].
And the answer is 12
So you have not been particulary lucky, due to good properties of round to nearest even rounding mode, out of these 169 numbers, only 12 don't behave as expected.
Which ones? Replace count: by select: and you get:
#(24 47 59 61 81 96 101 104 105 114 122 146)
If I had an Excel handy, I would ask to evaluate 146!/145!.
Curiously (only apparently curiously), a less naive solution that computes the exact factorial with large integer arithmetic, then convert to nearest float, does not perform better !
(2 to: 170) reject: [:n |
n factorial asFloat / (n-1) factorial asFloat = n]
leads to:
#(24 31 34 40 41 45 46 57 61 70 75 78 79 86 88 92 93 111 115 116 117 119 122 124 141 144 147 164)
I have a sample data range that has four categories,
foo | bar | bizz| buzz
---------------------------
163 345 456 2435
232 234 457 2435
123 346 234 3673
Foo is the dependant variable, bar, bizz and buzz are independant variables. I've went to Data Analysis => Regression => picked those columns as appropriate, gotten all of the regression statistics and some plots that represent it. How do I find the formula that it used so that I can use it in my predictions in an application?
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.462484844
R Square 0.213892231
Adjusted R Square 0.212161986
Standard Error 2991.441979
Observations 1367
ANOVA
df SS MS F Significance F
Regression 3 3318714896 1106238299 123.6196536 8.06738E-71
Residual 1363 12197112332 8948725.116
Total 1366 15515827228
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 703.0478619 126.1475776 5.5732173 3.01028E-08 455.5834102 950.5123135 455.5834102 950.5123135
Bar 41.53512531 2.493716675 16.65591193 7.6937E-57 36.64318651 46.42706411 36.64318651 46.42706411
Bizz 1.96479128 0.361015402 5.442402932 6.22595E-08 1.256585224 2.672997336 1.256585224 2.672997336
Buzz 16.77200247 5.419776635 3.094592933 0.002010941 6.139994479 27.40401046 6.139994479 27.40401046
RESIDUAL OUTPUT PROBABILITY OUTPUT
Observation Predicted foo Residuals Standard Residuals Percentile foo
1 6780.632281 34894.36772 11.67756172 0.036576445 63
2 6722.069851 28513.93015 9.542318743 0.109729334 63
3 3382.925842 21471.07416 7.185394378 0.182882224 63
Oh hey, my stats class looks 98% less useless now.
According to that output,
foo = 703.0478619 + 41.53512531 * bar + 1.96479128 * bizz + 16.77200247 * buzz
You can see these values where it lists the coefficients/standard errors for Intercept, Bar, Bizz, and Buzz.
Should probably note that the r squared value is extremely low, which (if I recall correctly) means that the variance in foo is not well explained by the independent variables.