Beating O(n!) for computing joint distributions - statistics

I have n normal distributions of varying means and standard deviations (some distributions do have the same standard deviations)
My goal is to compute the expected rank of the populations in each distribution.
One approach I came up with was subtracting each distribution from each other one then finding the proportion of the new distribution that was > 0. Doing this allowed me to populate a nxn matrix (where P(i>j) indicates the probability that distribution i is greater than distribution j):
0 | P(1>2) | P(1>3) .... P(1>n)
P(2>1) | 0 | P(2>3) .... P(2>n)
....
P(n>1) | P(n>2) | P(n>3) .... 0
But to compute the expected ranks I still had to do the following, where P(dist i, rank r) means the probability that a sample from distribution i is rank r eg. P(dist1, rank1) would mean the probability that a sample from distribution 1 is the first place winner/highest rank:
P(dist1,rank1) = P(1>2) & ... & P(1>n)
P(dist1,rank2) =
P(1<2) & P(1>3) & ... & P(1>n) |
P(1>2) & P(1<3) & ... & P(1>n) |
...
P(1>n) & P(1<2) & ... & P(1>n-1)
P(dist1, rank3) = ...
P(dist2, rank...)
....
Which gives me n! time (I routinely have ~150-200 distributions) so n! isn't feasible.
I'd like to avoid solutions that involve sampling the distributions if at all possible, but if that isn't possible any readings/papers that explain why it isn't an/ or offer robust sampling methods would be helpful
My desired result is for each of n distributions given m ranks I have the following:
P(dist1, rank1) | P(dist1, rank2) | ... | P(dist1, rankm)
...
P(distn, rank1) | P(distn, rank2) | ... | P(distn, rankm)
Said another way; if I were to sample the distributions what would the probability be that for a given distribution their sample(s) would be ranked highest? lowest? or each rank between?
Assuming I know the true mean and standard deviation for each distribution, how can I compute the probability that a given distribution's sample will be of rank n (time complexity comes in since I'm interested in knowing the probabilities for each rank for each distribution)

Related

How do I sum up a score based on 2 values?

I have this excel table
Status | Priority |
-------------------
Yes | High |
No | Medium |
N/A | Medium |
Yes | Low |
A bit | Bonus |
| |
| |
Each priority has a point value. Priority points can change to anything. They aren't in order. Note that lines can also be blank. Assuming that if priority is blank then status is also blank.
High = 3 points
Medium = 2 Points
Low = 1 Point
Bonus = 1 Point
Status's can be blank or any value. However if they are the following then they have coniditions:
Yes = Full point (eg. Yes with High priority gives 3 points) or (eg. Yes with Bonus gives 1 point).
A bit = Half a point (eg. A little with High priortiy gives half 1.5 points) or (eg. A little with Medium gives 1 point). Essentially halving the point.
If the Status is Yes then I want it to count the corresponding point value. So for the table above it should count up 4.5 points.
3 Points for Row 2
1 Point for Row 5
0.5 points for Row 6
I was wondering how I can do this?
I was going to do the following, but it only has one condition.
=COUNTIF(A2:A5, "Yes")
Using Tables and Named Ranges with structured references gives you a great deal of flexibility.
I first set up two tables
priorityTbl
statusTbl
With our Input, I Named the two Ranges Status and Priority
The total is then given by the formula:
=SUMPRODUCT(IFERROR(INDEX(statusTbl,MATCH(Status,statusTbl[Status],0),2),0),
IFERROR(INDEX(priorityTbl,MATCH(Priority,priorityTbl[Priority],0),2),0))
If you want to change the values you assign to the different Priority/Status items, you merely change them in the table.
You could also add new rows to the tables, if that is appropriate.
Note that I did not bother adding to the tables rows where the value might be zero, but you could if you wanted to.

How to find overlapping subproblem in coin change problem of this recursive code i can't find one

You are working at the cash counter at a fun-fair, and you have different types of coins available to you in infinite quantities. The value of each coin is already given. Can you determine the number of ways of making change for a particular number of units using the given types of coins?
counter = 0
def helper(n,c):
global counter
if n == 0:
counter += 1
return
if len(c) == 0:
return
else:
if n >= c[0]:
helper(n - c[0], c)
helper(n,c[1:])
def getWays(n, c):
helper(n,c)
print(counter)
return counter ```
#the helper function takes n and c
#where
#n is amount whose change is to be made
#c is a list of available coins
Let n be the amount of currency units to return as change. You wish to find N(n), the number of possible ways to return change.
One easy solution would be to first choose the "first" coin you give (let's say it has value c), then notice that N(n) is the sum of all the values N(n-c) for every possible c. Since this appears to be a recursive problem, we need some base cases. Typically, we'll have N(1) = 1 (one coin of value one).
Let's do an example: 3 can be returned as "1 plus 1 plus 1" or as "2 plus 1" (assuming coins of value one and two exist). Therefore, N(3)=2.
However, if we apply the previous algorithm, it will compute N(3) to be 3.
+------------+-------------+------------+
| First coin | Second coin | Third coin |
+------------+-------------+------------+
| 2 | 1 | |
+------------+-------------+------------+
| | 2 | |
| 1 +-------------+------------+
| | 1 | 1 |
+------------+-------------+------------+
Indeed, notice that returning 3 units as "2 plus 1" or as "1 plus 2" is counted as two different solutions by our algorithm, whereas they are the same.
We therefore need to apply an additional restriction to avoid such duplicates. One possible solution is to order the coins (for example by decreasing value). We then impose the following restriction: if at a given step we returned a coin of value c0, then at the next step, we may only return coins of value c0 or less.
This leads to the following induction relation (noting c0 the value of the coin returned in the last step): N(n) is the sum of all the values of N(n-c) for all possible values of c less than or equal to c0.
Happy coding :)

Why scikit learn confusion matrix is reversed?

I have 3 questions:
1)
The confusion matrix for sklearn is as follows:
TN | FP
FN | TP
While when I'm looking at online resources, I find it like this:
TP | FP
FN | TN
Which one should I consider?
2)
Since the above confusion matrix for scikit learn is different than the one I find in other rescources, in a multiclass confusion matrix, what's the structure will be? I'm looking at this post here:
Scikit-learn: How to obtain True Positive, True Negative, False Positive and False Negative
In that post, #lucidv01d had posted a graph to understand the categories for multiclass. is that category the same in scikit learn?
3)
How do you calculate the accuracy of a multiclass? for example, I have this confusion matrix:
[[27 6 0 16]
[ 5 18 0 21]
[ 1 3 6 9]
[ 0 0 0 48]]
In that same post I referred to in question 2, he has written this equation:
Overall accuracy
ACC = (TP+TN)/(TP+FP+FN+TN)
but isn't that just for binary? I mean, for what class do I replace TP with?
The reason why sklearn has show their confusion matrix like
TN | FP
FN | TP
like this is because in their code, they have considered 0 to be the negative class and one to be positive class. sklearn always considers the smaller number to be negative and large number to positive. By number, I mean the class value (0 or 1). The order depends on your dataset and class.
The accuracy will be the sum of diagonal elements divided by the sum of all the elements.p The diagonal elements are the number of correct predictions.
As the sklearn guide says: "(Wikipedia and other references may use a different convention for axes)"
What does it mean? When building the confusion matrix, the first step is to decide where to put predictions and real values (true labels). There are two possibilities:
put predictions to the columns, and true labes to rows
put predictions to the rows, and true labes to columns
It is totally subjective to decide which way you want to go. From this picture, explained in here, it is clear that scikit-learn's convention is to put predictions to columns, and true labels to rows.
Thus, according to scikit-learns convention, it means:
the first column contains, negative predictions (TN and FN)
the second column contains, positive predictions (TP and FP)
the first row contains negative labels (TN and FP)
the second row contains positive labels (TP and FN)
the diagonal contains the number of correctly predicted labels.
Based on this information I think you will be able to solve part 1 and part 2 of your questions.
For part 3, you just sum the values in the diagonal and divide by the sum of all elements, which will be
(27 + 18 + 6 + 48) / (27 + 18 + 6 + 48 + 6 + 16 + 5 + 21 + 1 + 3 + 9)
or you can just use score() function.
The scikit-learn convention is to place predictions in columns and real values in rows
The scikit-learn convention is to put 0 by default for a negative class (top) and 1 for a positive class (bottom). the order can be changed using labels = [1,0].
You can calculate the overall accuracy in this way
M = np.array([[27, 6, 0, 16], [5, 18,0,21],[1,3,6,9],[0,0,0,48]])
M
sum of diagonal
w = M.diagonal()
w.sum()
99
sum of matrices
M.sum()
160
ACC = w.sum()/M.sum()
ACC
0.61875

kullback leibler divergence limit

For a distribution of N values, how can I efficiently upper-bound the largest divergence between all non-negative distributions over the same random field? For example, for all distributions of a random variable that takes values in ([1,2,3,4]), i.e., N = 4, and the probability of a = 1 or a = 2 or a = 3 or a = 4 is always nonzero (but can be very small, e.g., 1e-1000).
Is there a known bound (other than infinity)? Say, given the number N, the divergence between the uniform distribution [1/4 1/4 1/4 1/4] and "delta" [1e-10 1e-10 1e-10 1/(1+3e-10)] over N is the largest?...
Thanks all in advance,
A.

Rating the straightness of a line

I have a data set that defines a set of points on a 2-dimensional Cartesian plane. Theoretically, those points should form a line, but that line may be perfectly horizontal, perfectly vertical, and anything in between.
I would like to design an algorithm that rates the 'straightness' of that line.
For example, the following data sets would be perfectly straight:
Y = 2/3x + 4
X | Y
---------
-3 | 2
0 | 4
3 | 6
Y = 4
X | Y
---------
1 | 4
2 | 4
3 | 4
X = -1
X | Y
---------
-1 | 7
-1 | 8
-1 | 9
While this one would not:
X | Y
---------
-3 | 2
0 | 5
3 | 6
I think it would work to minimize the sum of the squares of the distances of each point from to a line (usually called a regression line), then determine the average distance of each point to the line. Thus, a perfectly straight line would have an average distance of 0.
Because the data can represent a line that is vertical, as I understand it, the usual least-squares regression line won't work for this data set. A perpendicular least-squares regression line might work, but I've had little luck finding an implementation of one.
I am working in Excel 2010 VBA, but I should be able to translate any reasonable algorithm.
Thanks,
PaulH
The reason things like RSQ and LinEst won't work for this is because I need a universal measurement that includes vertical lines. As a line's slope approaches infinity (vertical), their RSQ approaches 0 even if the line is perfectly straight or nearly so.
-PaulH
Sounds like you are looking for R2, the coefficient of determinism.
Basically, you take the residual sum of squares, divide by the sum of squares and subtract from 1.
Use a Linear Regression. The "straightness" of the line is the R^2 value.
A value of 0 for the R^2 value implies it is perfectly straight. Increasing values imply increasing error in the regression, and thus the line is less and less "straight"
Could you try to catch the case of the vertical line before moving the least squares regression? If all x-values are the same, then the line is perfectly straight, no need to calculate an r^2 value.
Rough idea:
1. translate all coordinates to absolute values
2. calculate tan of current x/y
3. calculate tan of difference in x/y between current x/y and next x/y
4. difference in tan can give running deviation
Yes, use ordinary least squares method. Just use the Slope and Intercept functions in a worksheet. I expect there is a simple way to call these from the VBA codebehind.
Here's the VBA info. for R-Squared: http://www.pcreview.co.uk/forums/thread-1009945.php

Resources