Statistics: Prediction of multiple outputs vs. one input - statistics

I have a problem and would like to hear ur expert opinion. Let's say you have a one input value (call it TotalValue) (let's say with a range of 0-10), and there are 5 output values (call it Week1, Week2, Week3, Week4, Week5).
Now, for example, if the input value is 3, they will be distributed over the output values.
So one possible data point would be:
TotalValue: 3 --> Week1: 0, Week2: 1, Week3: 0, Week4: 1, Week5: 1
Another possible data point would be:
TotalValue: 4 --> Week1: 2, Week2: 0, Week3: 2, Week4: 0, Week5: 0
With the condition that the sum of the output values is always equal to the Input value.
Now given that sufficient data is available, what would be the best way to go about this problem? I'm guessing multi output regression will not fair well, since we have only one input value.
Thanks all for the help.

Related

Comparing the word-counts of two files, accounting for the number of occurrences

I'm currently working on a program which is supposed to find exploits for vulnerabilities in web-applications by looking at the "Document Object Model" (DOM) of the application.
One approach for narrowing down the number of possible DB-entries follows the strategy of further filtering the entries by comparing the word-count of the DOM and the database entry.
I already have two dicts (actually Dataframes, but showing dict here for better presentation), each containing the top 10 words in descending order of their numbers of ocurrences in the text.
word_count_dom = {"Peter": 10, "is": 6, "eating": 2, ...}
word_count_db = {"eating": 6, "is": 6, "Peter": 1, "breakfast": 1, ...}
Now i would like to calculate some kind of value, which represents how similar the two dicts are while accounting for the number of occurences.
Currently im using:
len([value in word_count_db for value in word_count_dom])
>>> 3
but this does not account for the number of occurrences at all.
Looking at the example i would like the program to give more value to the "is"-match, because of the generally better "Ranking-Position to Number of Occurences"-value.
Just an idea:
Compute for each dict the relative probability of each entry to occur (e.g. among all the top counts "Peter" occurs 20% of the time). Do this for each word occuring in either dict. And then use something like:
https://en.wikipedia.org/wiki/Bhattacharyya_distance

How can I use simulation tool in Excel for solving the following problem related to probability?

Trial Number 1 2 3 4 5 ........ 2000000 (two million)
Success in nth attempt 12 4 21 5 10 12
Note: Imagine throwing a dice where each outcome has probability of 1/10 (not 1/6 as it is usual for dice). For us "success" means throwing a "3". For each trial (see above) we keep throwing dice until we get "3". For example, above I assume that during first trial we threw dice 12 times and could get "3" only on 12th attempt. The same for other trials. For instance, on 5th trial we threw dice 10 times and could get "3" only on 10th attempt.
We need to simulate this for 2 million times (or lower than that, let's say 500,000 times).
Eventually we need to calculate what percent of "success" happens in interval of 10-20 tries, 1-10 tries etc.
For example, out of 2000000 trials in 60% of cases (1200000) we get "3" in between 10th and 20th attempts of throwing a dice.
Can you please help?
I performed a manual simulation, but could not create a simulation model. Can you please help?
This might be not a good solution for a large dataset as is your intent. Probably Excel is not the most efficient tool for that. Anyway here is a possible approach.
In cell A1, put the following formula:
=LET(maxNum, 10, trialNum, 5, maxRep, 20, event, 3, cols, SEQUENCE(trialNum,1),
rows, SEQUENCE(maxRep, 1), rArr, RANDARRAY(maxRep, trialNum, 1, maxNum, TRUE),
groupSize, 10, startGroups, SEQUENCE(maxRep/groupSize, 1,,groupSize),
GROUP_PROB, LAMBDA(matrix,start,end, LET(result, BYCOL(matrix, LAMBDA(arr,
LET(idx, IFERROR(XMATCH(event, arr),0), IF(AND(idx >= start, idx <= end), 1, 0)))),
AVERAGE(result))),
HREDUCE, LAMBDA(arr, LET(idx, XMATCH(event, arr),
IF(ISNUMBER(idx), FILTER(arr, rows <= idx),event &" not found"))),
trials, DROP(REDUCE(0, cols, LAMBDA(acc,col, HSTACK(acc,
HREDUCE(INDEX(rArr,,col))))),,1),
dataBlock, VSTACK("Trials", trials),
probBlock, DROP(REDUCE(0, startGroups, LAMBDA(acc,ss,
VSTACK(acc, LET(ee, ss+groupSize-1, HSTACK("%-Group "&ss&"-"&ee,
GROUP_PROB(trials, ss, ee))
))
)),1),
IFERROR(HSTACK(dataBlock, probBlock), "")
)
And here is the output:
Explanation
We use LET for easy reading and composition. We first define the parameters of the experiment:
maxNum, the maximum random number to be generated. The minimum will be 1.
trialNum, the number of trials (in our case the number of columns)
maxRep, the maximum number of repetitions in our case the number of rows.
rows and cols rows and columns respectively
event, our successful event, in our case 3.
groupSize, The size of each group for calculating the probability of each group
startGroups The start index position of each group
rArr, Random array of size maxRep x trialNum. The minimum random number will be 1 and the maximum maxNum. The last input argument of RANDARRAY ensures it generates only integer numbers.
GROUP_PROB is a user LAMBDA function to calculate the probability of our successful event: number 3 was generated.
LAMBDA(matrix,start,end, LET(result, BYCOL(matrix, LAMBDA(arr,
LET(idx, IFERROR(XMATCH(event, arr),0), IF(AND(idx >= start, idx <= end), 1, 0)))),
AVERAGE(result)))
Basically, for each column (arr) of matrix, finds the index position of the event and check if the index position belongs to the reference interval: start, end, if so return 1, otherwise 0. Finally, the AVERAGE function serves to calculate the probability. If the event was not generated, then it counts as 0 too.
We use the DROP/REDUCE/VSTACK or HSTACK pattern. Please check the answer to the question: how to transform a table in Excel from vertical to horizontal but with different length provided by #DavidLeal.
HREDUCE user LAMBDA function filters the rArr until the event is found. In case the event was not found, then it returns a string indicating the event was not found.
The name probBlock builds the array with all the probability groups

How can I return a minimum value in excel work sheet if the cells containing the values I am comparing , the values were created by a function

Am trying to return the minimum value of cells excluding zero but when ever I do it using this function, it returns a zero not the minimum
Function:{=MIN(IF(DW2:EE2 = 0,"",DW2:EE2))}
The values in the cells are created by a function which is:
=IF(BJ2="",0,IF(BJ2="D1","1",IF(BJ2="D2","2",IF(BJ2="C3","3",IF(BJ2="C4","4",IF(BJ2="C5","5",IF(BJ2="C6","6",IF(BJ2="P7","7",IF(BJ2="P8","8",IF(BJ2>="F9","9"))))))))))
enter image description here
This answer is wrong, but it might give you an idea.
=IF(MIN(A1:A5)=0,SMALL(A1:A5,2),MIN(A1:A5))
This means the following (I'm always working from a list of values in the cells "A1" till "A5"):
Verify the minimum of the list.
2.1. If the minimum equals zero, then
take the second smallest value.
2.2. If the minimum does not equal zero,
then take that minimum.
However, there is one problem with the implementation of that approach: I expected =Small(range,2) to give the second smallest number of a list, which it does, but it does not, let me show you:
Range : 0, 1, 2, 3, 4, 5 => Small(Range,2) = 1 => OK
Range : 0, 1, 2, 3, 4, 0 => Small(Range,2) = 0 => NOK
Apparently, Small(,2) just orders the range, and takes the second element, regardless of the fact that it might be equal to Small(,1).
Does anybody know a solution or workaround for this issue?

Is there an EXCEL formula that counts how many times Excel data 'switches' between binomial coding (i.e., data coded as 1 or 2)?

Folks, would really appreciate some help;
If I have the following values of data in Excel (3 examples below) where data is coded as 1 or 2, does anyone know an Excel formula which can count how many 'switches' occur in the sequence of values? What I mean by a 'switch', is when the 1's switch to 2's, and vice versa (when the 2's switch to 1's).
For example;
1, 1, 1, 1, 1, 1, 1, 2, 2 (there is 1 switch here)
1, 2, 2, 1, 1, 1, 1, 1, 1 (there are two switches here)
1, 2, 2, 1, 1, 1, 1, 2, 2 (there are three switches here)
So far, I am able to use the following formula (see below) to see IF there is a switch at all in the sequence (from 2 to 1 for example). But now I am trying to calculate how many switches are occurring in the sequence, and not IF a singular switch is occurring. So I think I possibly need to use a COUNT formula, instead of a FIND formula?
=IF(ISERROR(FIND("21",TEXTJOIN("",TRUE,[data range of a row]))),FALSE,TRUE)
Any help is appreciated.
If you have your data in cells A1 to I1, then use this formula:
=SUM(ABS(B1:I1-A1:H1))
I've tested this with your three inputs and it produces the expected answers.
If you would have access to FILTERXML (I believe from Excel 2013 onwards) you could use:
Formula in B1:
=COUNT(FILTERXML("<t><s>"&SUBSTITUTE(A1,", ","</s><s>")&"</s></t>","//s[following::*[1]!=.]"))
For your understanding; the xpath expression //s[following::*[1]!=.] will return all (child)nodes where the first following node is different then the current one. Then COUNT will actually count these returned numbers. Note that this will return 0 when no change occurs in your string.

How to get the second largest value in a column

Recently I discovered the LARGE and SMALL worksheet functions, one can use for determining the first, second, third, ... larges of smalles value in an array.
At least, that's what I thought:
When having a look at the array [1, 3, 5, 7, 9] (in one column or row), the LARGE(...;2) gives 7 as expected, but:
When having a look at the array [1, 1, 5, 9, 9], I expect LARGE(...;2) to give 5 but instead I get 9.
Now this makes sense : it seems that the function LARGE(...;2) takes the largest entry in the array (value 9 on the last but one place), deletes this and gives the larges entry of the reduced array (which still contains another 9), but this is not what one might expect intuitively.
In order to get 5 from [1, 1, 5, 9, 9], I would need something like:
=LARGE_OF_UNIQUE_VALUES_OF(...;2))
I didn't find this in LARGE documentation.
Does anybody know an easy way to achieve this?
If you have the new Dynamic Array formulas:
=LARGE(UNIQUE(...),2)
If not use AGGREGATE:
=AGGREGATE(14,7,A1:A5/(MATCH(A1:A5,A1:A5)=ROW(A1:A5)),2)
This is a bit of a hack.
=LARGE(IF(YOUR_DATA=LARGE(YOUR_DATA,1),SMALL(YOUR_DATA,1)-1,YOUR_DATA),1)
The idea is to (a) take any value in your data that is equal to the largest element and set it to less than the smallest element, then (b) find the (new) largest element. It's OK if you want the 2nd largest, but extending to 3rd largest etc. gets progressively uglier.
Hope that helps

Resources