Random Numbers with assigned Probability in Excel

Random Numbers with assigned Probability in Excel - excel

Following is what I am trying to do:
I have got a bunch of items, lets say from A to J, a total of 10 items. Now I want to generate a total of 20 draws and in each draw I need 3 items from the above 10 items. Now if the first item comes out as A, it should not show up in second and third item irrespective of its assigned probability.
Lets say:
A - 4%
B - 20%
C - 1%
D - 16%
E - 5%
F - 7%
G - 3%
H - 21%
I - 6%
J - 17%
Now, I need to randomly generate 3 items from the above list in each draw, according to their assigned probabilities, but lets say if first item is B, the second and third item should not be B. I should repeat the same process for 20 draws.
Answer Should Look something like this:
1st Item 2nd Item 3rd Item
1st Draw B D J
2nd Draw D E F
3rd Draw B H G
The numbers should be generated according to their assigned probabilities.
Thanks in advance.

For a formula route:
You will need to build two helper columns. The first is a running total
I put your values in G1:H10
Then in I1 I put 1
In I2 I put:
=I1+(H1*100)
And copied down:
I then created the second helper. In K1 I put:
=INDEX(G:G,MATCH(ROW(1:1),I:I))
And copied down 100 rows.
This created a dynamic range of the probability.
Then in B2 I put:
=INDEX($K:$K,AGGREGATE(15,6,ROW($1:$100)/(COUNTIF($A2:A2,$K$1:$K$100)=0),RANDBETWEEN(1,100-SUMPRODUCT(COUNTIF($A2:A2,$K$1:$K$100)))))
Copied over three and down as many as wanted:
Caveats:
This works only with whole percentages, no 20.513%.
Every choice in the table must be at least 1%.
The total percentage must equal 100%

Here is another way according to this website (https://www.mrexcel.com/forum/excel-questions/372071-random-numbers-assigned-probabilities.html) post #7. You can use cumulative value and do the similar test.
Add a helper column C and use this formula: =SUM($B$2:B2), and drag down.
On cell F2, you can enter this formula:
=INDEX($A$2:$A$11,COUNTIF($C$2:$C$11,"<="&RAND())+1)
It is basically counting the rows using RAND function and add 1 (the header row) to pick the item. Give it a try and let me know.

Here's something that combines the approaches in the two earlier answers. Like ian0411's answer it utilises the cumulative probability distribution. It also utilises Scott Craner's techique of constructing an array of 0's and 1's which indicate matches between the possible outcomes (A,...,J) and those which have already been drawn.
The formula in cellP2 is
=INDEX($A$2:$A$11,1+IFERROR(MATCH(RAND(),MMULT($D$2:$M$11,($B$2:$B$11)*(1-COUNTIF($O2:O2,$A$2:$A$11))/SUMPRODUCT($B$2:$B$11,(1-COUNTIF($O2:O2,$A$2:$A$11))))),0))
and this is copied into cells Q2 and R2 and then dragged down for each draw.
The probability distribution in $B$2:$B$11 has elements replaced by zeroes for any prior drawn items in the current draw (hat-tip to Scott for how this was achieved). The adjusted distribution is converted to a cumulative format via the kludgy matrix multiplication operation (couldn't think of a more elegant approach) and normalised to a "proper" cumulative distribution by dividing by the sum of its elements. Rather than using =1+COUNTIF(cumdist,"<="&RAND()) (where cumdist is the cumulative distribution) to pick out the element of cumdist matching to the random variable, I have used an alternative of =1+IFERROR(MATCH(RAND(),cumdist),0).

Related

Calculating a running sum with reset in Excel

I have a need to calculate a running sum of column D's count data in column E. However, I only want to calculate the running sum for the appropriate categories in columns B and C. In other words, there are four combinations of categories and I need a running sum for each. The easiest way is to do what I currently have in column F (cell F3=SUM($D$2:D3)) and drag it down through F11 and manually restart it at F12. I can't do this in my full dataset though because there are about 20k rows of data. So, I'm trying to make column E dynamically calculate what's in column F. I started with =SUMIFS() and can return the final sum for each combination of the two categories, but it's not a running sum, created dynamically, that resets with the new day count in column A.
Any suggestions would be appreciated. TIA

Initial Response
If I understand the problem correctly, the solution is actually very simple.
This formula goes to E2 (and then copy down):
=IF(B1&C1<>B2&C2,D2,SUM(E1,D2))
In each case (including the first) of a change of either Cat A or Cat B, it takes the count value at that row (i.e. the new starting balance). Thereafter, it does a running balance addition (balance from row above + count at this row), until the next change of Cat A or Cat B is encountered.
Catering for a wider range of Categories:
The above assumes: Cat A and Cat B are only ever 0 or 1 (per the example in OP).
If this isn't true (and the Cats could be any range of values), change the formula per chris neilsen's suggestion (per comments):
=IF(OR(B1<>B2,&C1<>&C2),D2,SUM(E1,D2))
Catering for Previous Same-State Categories:
Both the original formula and the alternate above assume:
Should Cat A & Cat B states change, but subsequently return a 'previous same-state', the running total should still start afresh (i.e. don't 'carry forward' from 'previous same-states').
If one wanted the running balance to include the balance of previous same-states for Cat A & Cat B, use solution suggested by Apostolos55 (below)
=SUMIFS($D$1:D2,$B$1:B2,B2,$C$1:C2,C2)

all you need is a minor adjustment to the cumulative count:
=Sumifs($D$2:D2,$B$2:B2,B2,$C$2:C2,C2)
Put in Current Count and it is done. Drag/drop OR Autocomplete down as needed
Now it takes into account only above/previous than the current...

It's a bit hacky, but you can add a few columns:
consecutive numbers from 2 to end of your data (say this is in Column G).
references to every 10th cell (e.g. G2, G12) for every cell in that cluster. (say this is in Column H)
you can fill in the first few rows for these two columns and then drag it down.
a reference to the count column ('D') in a single cell (say I1).
You can then use CONCATENATE and INDIRECT inside SUM:
Cumulative Count:
SUM(INDIRECT(CONCATENATE($I$2,H2)):D2)
and drag this down.

Excel Calculate a running average on filtered data with criteria

I am trying to calculate a running average on a filtered data table. All other posts online use Sumproduct(Subtotal) on the entire range but do not calculate a row by row running average
I am stuck on how to calculate columns C and D.
If column B (Score) > 0, I want to sum and average it under column C (Average Win)
If column B (Score) < 0, I want to sum and average it under column D (Average Loss)
The table is filterable by column A (Type) and the results should look as follows
Progress so far:
I have figured out how to calculate a Cumulative score based on filtered data. However this does not fully solve my problem. I appreciate any help!
=SUBTOTAL(3,B3)*SUBTOTAL(9,B$3:B3)
SUBTOTAL(3,B3) checks if the current row is visible, SUBTOTAL(9,B$3:b3) sums the values.
Final update needed
Jos - Thank you for your detailed explanation on how subtotal() works. I learned a ton through your explanation and will continue to study it. This is my first time being exposed to structured referencing so some of the syntax is a bit confusing to me still
The last formula I need is a running win % column where a Win is defined by score > 0. Please see the picture below
My assumptions believe that the same formula would work, except that we average a 1 or 0 in each row instead of the [Score] column.
Using the prior solution, why can't we modify the output of your prior solution to calculate a running win %?
[...] IF([Score]>0,IF(ROW([Score])<=ROW([#Score]),[Win])))),0)
Where [Win] is a helper column with the outputs 1 for win, 0 for loss.
This could be done by saying
if([#score]>0,1,0)
Instead of averaging out the actual #Score, this would average out a column of 1's and 0's with the desired output 0%, 50%, 66%, etc.
I am aware that the solution I provided does not work but I am trying to embrace the correct logic. I still struggle to understand how these structured column references are calculated on a row by row basis.
For example: Average(If([Score]>0,[Score])
How is this calculated on a row by row basis? When A3 does If([Score] > 0,), does this equal If({-10}>0)? When on A4, does If([Score]>0) equal If({-10,20} >0)? Thank you for your patience and help thus far.

I disagree with your result for Average Loss for the last row of your unfiltered table (surely -9.33...?), but try this for Average Win:
=IFERROR(AVERAGE(IF(SUBTOTAL(3,OFFSET(INDEX([Score],1),ROW([Score])-MIN(ROW([Score])),)),IF([Score]>0,IF(ROW([Score])<=ROW([#Score]),[Score])))),0)
Same formula for Average Loss, changing [Score]>0 to [Score]<0.
Explanation:
Using the data you provided and assuming:
The table's top-left cell is in A1
The table is filtered on the Type column for "A"
In order to determine which rows are filtered, we must pass an array of range references - i.e. for each cell within a chosen column of the table - to the SUBTOTAL function. It's a touch unfortunate that such an array of range references can only be generated via a volatile function (INDIRECT or OFFSET), but here, unless we resort to helper columns, we are left with no choice.
INDEX([Score],1)
simply returns a range reference to the first cell within the Score column. When using Excel tables, it's preferable not to write formulas which include a mixture of structured and non-structured referencing, even if that results in slightly longer expressions. So here, for example, we would not reference A2 within the formula.
ROW([Score])-MIN(ROW([Score]))
generates an array of integers from 0 up to one fewer than the number of rows in the table, i.e.
{0;1;2;3;4}
and so
=IFERROR(AVERAGE(IF(SUBTOTAL(3,OFFSET(INDEX([Score],1),ROW([Score])-MIN(ROW([Score])),)),IF([Score]>0,IF(ROW([Score])<=ROW([#Score]),[Score])))),0)
becomes
=IFERROR(AVERAGE(IF(SUBTOTAL(3,OFFSET(A2,{0;1;2;3;4},)),IF([Score]>0,IF(ROW([Score])<=ROW([#Score]),[Score])))),0)
OFFSET then generates an array of range references (though note that you will not be able to 'see' this step within the Evaluate Formula window - rather, an array of #VALUE! errors is displayed):
=IFERROR(AVERAGE(IF(SUBTOTAL(3,{A2;A3;A4;A5;A6}),IF([Score]>0,IF(ROW([Score])<=ROW([#Score]),[Score])))),0)
SUBTOTAL then determines which of these range references is filtered (note that care must be given here to the choice of first parameter), returning the relevant Boolean, so that:
SUBTOTAL(3,{A2;A3;A4;A5;A6})
resolves to:
{1;1;1;0;1}
And so we now have:
=IFERROR(AVERAGE(IF({1;1;1;0;1},IF([Score]>0,IF(ROW([Score])<=ROW([#Score]),[Score])))),0)
and the rest is straightforward.

So, I would use averageifs().
=averageifs(B:B,B:B,">=1",A:A,"A")
is one example, note I have added the control of Type A in the example.
See:

lowest value in top 50%

I'm working on a King of the Hill/Elimination bracket spreadsheet & I have a cell that I want to return the cut (last score in the top 50%). Would anyone know how to go about this?
I have a formula for the average of the range, excluding 0's, but that isnt accurate since it isnt actually showing the lowest score. =AVERAGEIF(F9:G667,"<>0")

What you're describing is "percentile". Excel's percentile function interpolates, so I'm not sure it's appropriate for your use case.
See https://support.office.com/en-us/article/percentile-function-91b43a53-543c-4708-93de-d626debdddca
At the very least, you can compute the percentile, then take the minimum score of all values filtered to be above the interpolated 50th percentile.
Here's a slightly clever implementation:
Assume your data is in the range C2:C11.
In c13, we'll compute the 50th percentile as =PERCENTILE(C2:C11, 0.5)
In column d, we'll use an IF statement to either select the adjacent value from column c, or a very large number, depending on whether the value is greater than the percentile. E.g., =IF(C2 > $C$13,C2,400000)
Now we can take the min of column d: =MIN(D2:D11)
The only clever bit is using a giant number when the value in column c is less than the percentile, so that it effectively becomes invisible to the min operation.

The last score in the top 50% is the kth largest item where k is the number of items divided by 2. Excel has a useful function called LARGE, which returns the kth largest item. It also have an even more useful function called AGGREGATE which lets you do things like SUM, COUNT, AVERAGE, LARGE, SMALL, etc - but skips hidden rows or Error Values.
So, to get the kth item of the list (ignoring Error Values) we would use =AGGREGATE(15, 6, F9:G667, k) - of course, this is not going to skip 0, because 0 is not an error. But, you know what is? Divide by 0. So, if we do F9*F9/F9, then we will either get F9 (for F9<>0) or the #DIV/0! error.
This now means our function is =AGGREGATE(15, 6, F9:G667*F9:G667/F9:G667, k), but we still need to decide on a value for k. Well, if we have 3 items then we want the 2nd one, if we have 6 items then we want the 6rd one - so, for n items we want item n÷2, rounded up. Well, that's what the ROUNDUP function is for!
Still, we need to know what n is - but that's simple. It's just the number of non-0 items in the list, or COUNTIF(F9:G667,"<>0") (Or ">0" if your list cannot contain negative numbers)
Plug it all together, and we get this:
=AGGREGATE(15, 6, F9:G667*F9:G667/F9:G667, ROUNDUP(COUNTIF(F9:G667,"<>0"), 0))

Calculate the average of the 3 highest grades

I'm relatively new to excel but I am making a gradebook that calculates all of my grades.
One of my classes has an interesting way to calculate grades. There are 4 quizzes, and the lowest one will be dropped (essentially removed from the calculation completely).
How would I go about this, I tried using
=((SUM(D2:D5)-SUM(SMALL(D2:D5,4)))/(COUNT(D2:D5)*100))
on data like this
D2|75
D3|80
D4|83
D5|65
So in this case, I want the 65 to be removed, then calculate the average
I am not getting any error but the average is wrong

Subtract 1 from the count. In the example you are dividing by 4 instead of 3.
Like this:
=((SUM(D2:D5)-MIN(D2:D5))/(COUNT(D2:D5) - 1) * 100

You can average the top 3 out of 4 with this formula
=AVERAGE(LARGE(D2:D5,{1,2,3}))

Add the four tests together, subtract out the min(.) of the same range, and divide the total by 3.
If your scores are in B2, C2, D2, and E2 then something like:
=(SUM(B2:E2)-MIN(B2:E2))/3
Was this helpful?

Can I use a built-in Excel solver to solve this equation somehow? If not, how would you go about it?

First of all, let me show you guys the equation in question.
In this equation S, V, and t are known constants. CFL is also known. We have an initial value for D, and we have no idea what k is.
What I need to do is find ideal values for both D and k that would minimize the residuals squared of a calculated CFL and a measured CFL. Using residuals squared is just a way for me to check if they're the best possible values, but it's fine if there's another way to go about this that uses some other method.
The residual squared is just the absolute value of the difference between the calculated and measured CFLs, which is then squared. The lower the residual squared, the better the fit we have. So I need the smallest possible residual squared resulting from putting both k and D into the equation. That'll result in a calculated CFL, which I can then compare to a measured CFL, allowing me to calculate the residual squared.
My first idea for how to do this, since I'm not sure how to use Excel equations, was to fix the value of D (since we have an initial starting value to work from) and then vary through different values of k, putting them into the equation to find a calculated CFL, and comparing that to the measured to find the residuals squared, until I find one that results with the smallest residuals squared. Then I fix k at that ideal value, and vary D until I find the smallest residual there as well. Then I fix D again, and go back to varying k. My idea was that I could keep bouncing back and forth like that until both D and k were within a certain percentage of their previous values. I assumed it would reach some sort of equilibrium with this method
However, the numbers just go crazy, and end up either going to zero or going to infinity. So I need to rework my process. Which is where you guys come in!
How would you go about finding the most ideal values for both D and k, which would result in a calculated CFL closest to the measured one, assuming you are given values for every variable above apart from k? Remember to factor in that the value of D given initially is simply a starting place to work from, and is not the most ideal value.
I've been working on this program for a long time (at least a month), and I'm just stuck as hell and desperate. I was hoping you guys could help me out.
Here are some initial values to work with:
S = 19.634954
V = 12.271846
D (initial) = 0.01016482
CFL (measured) = 0.401
t = 4
k = ?
Thank you for any ideas you might have.

As Dean said, your system has two unknowns, and in the general case an infinite number of solutions (different pairs of (D,k)). By fixing D, CFL is a continuous function of k, and as such, you should be able to find a k that gives the CFL you measured (within some accuracy). For this problem (i.e., finding k given CFL) you can use the Goal Seek tool. Here is how:
1) Problem setup:
Use the name of the variables to name the cells in which you input their values (Go to Formulas--> Defined Names --> Define Name and give some the name of each variable to a cell). Then input the values of your parameters in these cells, (give k an arbitrary value, eg = 1), and input the formula in cell CFL like:
=(S/V)*SQRT(D/k)*(ERF(SQRT(k*t))+SQRT(k*t/PI())*EXP(-k*t))
Again, note that S,V,D,k and t are defined as named ranges.
2) Problem Solution:
Go To Data --> Data Tools --> What-If Analysis --> Goal Seek and enter the following parameters:
Set Cell: CFL
To value: 0.401
By changing cell: k
This gave me k=0.151759378, which results in CFL = 0.401261265054823.
I hope this helps?
Edit: Finding some solution pairs using VBA:
1) Place the measured CFL value in a cell (I chose H2).
2) Replace named ranges k, D and CFL. I used rngK, rngD and rngCFL, each one starting from row 2 till row 20.
3) Fill down rngD with a step (I took 0.01) using the formula =INDEX(rngD,ROW()-ROW($C$2))+0.01. The first entry of rngD is in cell C2 and has the value 0.01016482. The formula is copied down to all other cells in the range.
4) Fill down rngK with some initial values (I took =1).
5) Fill down the rngCFL range with the formula =(S/V)*SQRT(INDEX(rngD,ROW()-ROW($G$1))/INDEX(rngK,ROW()-ROW($G$1)))*(ERF(SQRT(INDEX(rngK,ROW()-ROW($G$1))*t))+SQRT(INDEX(rngK,ROW()-ROW($G$1))*t/PI())*EXP(-INDEX(rngK,ROW()-ROW($G$1))*t)). I use the ROW() and INDEX() functions to refer to the Range element I need.
6) Finally, use this code in a sub:
Dim iCnt As Long
For iCnt = 1 To Range("rngk").Count
Range("rngCFL")(iCnt).GoalSeek goal:=Range("H2"), changingCell:=Range("rngK")(iCnt)
Next iCnt
The above generates 19 pairs (D,k) that give the measured CFL value.

You can't solve for two unknown variables in a 1 formula system. However if I take D as given then you have a 1 unknown/1 formula system.
I just simply used 1 column as a guess of k (for me column B. I used another column to represent the calculated CFL with the guessed k (for me column C). I have another column that has either a 1 or -1 (for me column D). Lastly I have a column that represents the absolute value by which I want to increment my guess.
I named cells with the given values of the variables to make it easier to use them.
I started with a guess of k=1.
Here are my formulas in my first row which was 7.
B7=.1
C7 =(s/v)*(d/B7)^0.5*(ERF(((B7*t)^0.5))+((B7*t)/PI())^0.5*EXP(-1*B7*t))
nothing in D7 or E7
in row 8:
B8=B7+E8+D8
C8==(s/v)*(d/B8)^0.5*(ERF(((B8*t)^0.5))+((B8*t)/PI())^0.5*EXP(-1*B8*t))
D8=1
E8=.01
in row 9 the B and C column is just copied down but D and E are as follows
D9==IF(C9>cfl,1,-1)
E9==IF(D9=D8,E8,E8/10)
Once you get those in you can just copy down however many rows you want.
What this does is every time the residual of the CFL switches signs the increment's sign will also flip. Additionally, the absolute value of the increment will also shrink by a factor of 10 to give more precision as it goes.
This is by no means the best way to solve your problem but it is a way.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string