kullback leibler divergence limit - statistics

For a distribution of N values, how can I efficiently upper-bound the largest divergence between all non-negative distributions over the same random field? For example, for all distributions of a random variable that takes values in ([1,2,3,4]), i.e., N = 4, and the probability of a = 1 or a = 2 or a = 3 or a = 4 is always nonzero (but can be very small, e.g., 1e-1000).
Is there a known bound (other than infinity)? Say, given the number N, the divergence between the uniform distribution [1/4 1/4 1/4 1/4] and "delta" [1e-10 1e-10 1e-10 1/(1+3e-10)] over N is the largest?...
Thanks all in advance,
A.

Related

What's the difference between these two methods for calculating a weighted median?

I'm trying to calculate a weighted median, but don't understand the difference between the following two methods. The answer I get from weighted.median() is different from (df, median(rep(value, count))), but I don't understand why. Are there many ways to get a weighted median? Is one more preferable over the other?
df = read.table(text="row count value
1 1. 25.
2 2. 26.
3 3. 30.
4 2. 32.
5 1. 39.", header=TRUE)
# weighted median
with(df, median(rep(value, count)))
# [1] 30
library(spatstat)
weighted.median(df$value, df$count)
# [1] 28
Note that with(df, median(rep(value, count))) only makes sense for weights which are positive integers (rep will accept float values for count but will coerce them to integers). This approach is thus not a full general approach to computing weighted medians. ?weighted.median shows that what the function tries to do is to compute a value m such that the total weight of the data below m is 50% of the total weight. In the case of your sample, there is no such m that works exactly. 28.5% of the total weight of the data is <= 26 and 61.9% is <= 30. In a case like this, by default ("type 2") it averages these 2 values to get the 28 that is returned. There are two other types. weighted.median(df$value,df$count,type = 1) returns 30. I am not completely sure if this type will always agree with your other approach.

Binary Search Complexity

What is the time complexity of binary search taking an array of n elements as input from user..
As the time complexity of binary search is O(log n)
Whereas, the time complexity of taking the array as input from user is O(n)
For the whole program the answer is: O(logn) + O(n) = O(n)
Why is it? because your input of n numbers into an array and binary search is independent of each other.
Now we need to understand time complexity in general:
In the simplest terms, for a problem where the input size is n:
Best case = fastest time to complete, with optimal inputs chosen.
For example, the best case for a sorting algorithm would be data that's already sorted.
Worst case = slowest time to complete, with pessimal inputs are chosen.
For example, the worst case for a sorting algorithm might be data that are sorted in reverse order (but it depends on the particular algorithm).
Average case = arithmetic mean. Run the algorithm many times, using many different inputs of size n that come from some distribution that generates these inputs (in the simplest case, all the possible inputs are equally likely), and compute the total running time (by adding the individual times), and divide by the number of trials. You may also need to normalize the results based on the size of the input sets.
Example (Binary search):
Suppose we have the following binary search function:
binarySearch(arr, x, low, high)
repeat till low = high
mid = (low + high)/2
if (x == arr[mid])
return mid
else if (x > arr[mid]) // x is on the right side
low = mid + 1
else // x is on the left side
high = mid - 1
Now, let's analyze its time complexity.
Best Case Time Complexity of Binary Search
The best case of Binary Search occurs when:
The element to be searched is in the middle of the list
In this case, the element is found in the first step itself and this involves 1 comparison.
Therefore, Best Case Time Complexity of Binary Search is O(1).
Average Case Time Complexity of Binary Search
Let the input be N distinct numbers: a1, a2, ..., a(N-1), aN
We need to find element P.
There are two cases:
Case 1: The element P can be in N distinct indexes from 0 to N-1.
Case 2: There will be a case when the element P is not present in the list.
There are N case 1 and 1 case 2. So, there are N+1 distinct cases to consider in total.
If element P is in index K, then Binary Search will do K+1 comparisons.
This is because:
The element at index N/2 can be found in 1 comparison as Binary Search starts from middle.
Similarly, in the 2nd comparisons, elements at index N/4 and 3N/4 are compared based on the result of 1st comparison.
On this line, in the 3rd comparison, elements at index N/8, 3N/8, 5N/8, 7N/8 are compared based on the result of 2nd comparison.
Based on this, we know that:
Elements requiring 1 comparison: 1
Elements requiring 2 comparisons: 2
Elements requiring 3 comparisons: 4
Therefore, Elements requiring I comparisons: 2^(I-1)
The maximum number of comparisons = Number of times N is divided by 2 so that result is 1 = Comparisons to reach 1st element = logN comparisons
I can vary from 0 to logN
Total number of comparisons = 1 * (Elements requiring 1 comparison) + 2 * (Elements requiring 2 comparisons) + ... + logN * (Elements requiring logN comparisons)
Total number of comparisons = 1 * (1) + 2 * (2) + 3 * (4) + ... + logN * (2^(logN-1))
Total number of comparisons = 1 + 4 + 12 + 32 + ... = 2^logN * (logN - 1) + 1
Total number of comparisons = N * (logN - 1) + 1
Total number of cases = N+1
Therefore, average number of comparisons = ( N * (logN - 1) + 1 ) / (N+1)
Average number of comparisons = N * logN / (N+1) - N/(N+1) + 1/(N+1)
The dominant term is N * logN / (N+1) which is approximately logN. Therefore, Average Case Time Complexity of Binary Search is O(logN).Let there be N distinct numbers: a1, a2, ..., a(N-1), aN
We need to find element P.
There are two cases:
Case 1: The element P can be in N distinct indexes from 0 to N-1.
Case 2: There will be a case when the element P is not present in the list.
There are N case 1 and 1 case 2. So, there are N+1 distinct cases to consider in total.
If element P is in index K, then Binary Search will do K+1 comparisons.
This is because:
The element at index N/2 can be found in 1 comparison as Binary Search starts from the middle.
Similarly, in the 2nd comparison, elements at index N/4 and 3N/4 are compared based on the result of the 1st comparison.
On this line, in the 3rd comparison, elements at index N/8, 3N/8, 5N/8, and 7N/8 are compared based on the result of the 2nd comparison.
Based on this, we know that:
Elements requiring 1 comparison: 1
Elements requiring 2 comparisons: 2
Elements requiring 3 comparisons: 4
Therefore, Elements requiring I comparisons: 2^(I-1)
The maximum number of comparisons = Number of times N is divided by 2 so that result is 1 = Comparisons to reach 1st element = logN comparisons
I can vary from 0 to logN
Total number of comparisons = 1 * (Elements requiring 1 comparison) + 2 * (Elements requiring 2 comparisons) + ... + logN * (Elements requiring logN comparisons)
Total number of comparisons = 1 * (1) + 2 * (2) + 3 * (4) + ... + logN * (2^(logN-1))
Total number of comparisons = 1 + 4 + 12 + 32 + ... = 2^logN * (logN - 1) + 1
Total number of comparisons = N * (logN - 1) + 1
Total number of cases = N+1
Therefore, average number of comparisons = ( N * (logN - 1) + 1 ) / (N+1)
Average number of comparisons = N * logN / (N+1) - N/(N+1) + 1/(N+1)
The dominant term is N * logN / (N+1) which is approximately logN. Therefore, the Average Case Time Complexity of Binary Search is O(logN).
Worst Case Time Complexity of Binary Search
The worst case of Binary Search occurs when:
The element to search is in the first index or last index
In this case, the total number of comparisons required is logN comparisons.
Therefore, the Worst Case Time Complexity of Binary Search is O(logN).

Why scikit learn confusion matrix is reversed?

I have 3 questions:
1)
The confusion matrix for sklearn is as follows:
TN | FP
FN | TP
While when I'm looking at online resources, I find it like this:
TP | FP
FN | TN
Which one should I consider?
2)
Since the above confusion matrix for scikit learn is different than the one I find in other rescources, in a multiclass confusion matrix, what's the structure will be? I'm looking at this post here:
Scikit-learn: How to obtain True Positive, True Negative, False Positive and False Negative
In that post, #lucidv01d had posted a graph to understand the categories for multiclass. is that category the same in scikit learn?
3)
How do you calculate the accuracy of a multiclass? for example, I have this confusion matrix:
[[27 6 0 16]
[ 5 18 0 21]
[ 1 3 6 9]
[ 0 0 0 48]]
In that same post I referred to in question 2, he has written this equation:
Overall accuracy
ACC = (TP+TN)/(TP+FP+FN+TN)
but isn't that just for binary? I mean, for what class do I replace TP with?
The reason why sklearn has show their confusion matrix like
TN | FP
FN | TP
like this is because in their code, they have considered 0 to be the negative class and one to be positive class. sklearn always considers the smaller number to be negative and large number to positive. By number, I mean the class value (0 or 1). The order depends on your dataset and class.
The accuracy will be the sum of diagonal elements divided by the sum of all the elements.p The diagonal elements are the number of correct predictions.
As the sklearn guide says: "(Wikipedia and other references may use a different convention for axes)"
What does it mean? When building the confusion matrix, the first step is to decide where to put predictions and real values (true labels). There are two possibilities:
put predictions to the columns, and true labes to rows
put predictions to the rows, and true labes to columns
It is totally subjective to decide which way you want to go. From this picture, explained in here, it is clear that scikit-learn's convention is to put predictions to columns, and true labels to rows.
Thus, according to scikit-learns convention, it means:
the first column contains, negative predictions (TN and FN)
the second column contains, positive predictions (TP and FP)
the first row contains negative labels (TN and FP)
the second row contains positive labels (TP and FN)
the diagonal contains the number of correctly predicted labels.
Based on this information I think you will be able to solve part 1 and part 2 of your questions.
For part 3, you just sum the values in the diagonal and divide by the sum of all elements, which will be
(27 + 18 + 6 + 48) / (27 + 18 + 6 + 48 + 6 + 16 + 5 + 21 + 1 + 3 + 9)
or you can just use score() function.
The scikit-learn convention is to place predictions in columns and real values in rows
The scikit-learn convention is to put 0 by default for a negative class (top) and 1 for a positive class (bottom). the order can be changed using labels = [1,0].
You can calculate the overall accuracy in this way
M = np.array([[27, 6, 0, 16], [5, 18,0,21],[1,3,6,9],[0,0,0,48]])
M
sum of diagonal
w = M.diagonal()
w.sum()
99
sum of matrices
M.sum()
160
ACC = w.sum()/M.sum()
ACC
0.61875

Analyse runtime of my algorithm

I am working on creating some algorithms for a course, both of which are for the vertex cover problem.
For the first part I created an algorithm that does the work via brute force, it creates every possible combination of vertices, removes sets that are not covers, then analyses them. This size I already have.
The second part is the same brute force with an added heuristic, where I eliminate the lower portion of combos that are unlikely to make a cover based on the number of edges.
Since both of these do work on the sum of all base elements in the combos I need to understand the size of said list.
The graphs are randomly generated with integers for vertices and edges created randomly from pairs of vertices.
combos = []
vertices = [1, 2, 3,...]
edges = [(1, 2), (2, 3),...]
E = len(edges)
V = len(vertices)
Brute force
for x in range(1, V+1):
for subset in itertools.combinations(vertices, X):
combos.append(subset)
sum = 0
for i in combos:
for j in i:
sum += 1
The sum of brute force is:
Heuristc:
for x in range(ceil((V**2)/E), V+1):
for subset in itertools.combinations(vertices, X):
combos.append(subset)
sum = 0
for i in combos:
for j in i:
sum += 1
The sum as I thought it would end up being:
However, my test runs are not matching up for heuristic, brute force is matching up.
Sample runs:
V E Brute Heuristic
5 10 80 25
6 11 192 36
7 17 448 294
8 23 1024 792
9 25 2304 1467
10 36 5120 4660
Ok, so I was doing my math formula wrong, the heuristic is not the sum of all V minus the sum of the lower end. It is the sum from the lower end to all V:
I do not know exactly why this works and my original does not, since logically the other one is just an expanded version of this one.

How do I use Cosine similarity for this use case?

If I have a query vector A and an item vector B, it would be great if someone can guide me how to weigh/normalize the vectors (strategies for the same).
Vector A would have the following components ( property1 (binary), property2 (binary), property 3 (int from range 0 to 50), property4 (int from range(0 to 10)
Vector B would have the same properties
I know that the angle between these 2 vectors using cosine similarity would give me the distance between the 2 vectors. I want to create a recommendation based on the similarity.
But i am not clear on how to normalize the properties and or the vectors in this case since it is binary+binary_int range +int range. Also, if I want to grant higher weightage to one property than the other, how do i do so. what options do i have.
I find examples of cosine similarity online with documents, but in this case the Vectors A and B are not documents so i am not using TF-idf in this case.
Please advise,
Thanks
If you want to use the traditional cosine similarity between the two vectors for td/idf, then each term is a dimension in your vector. That is, you need to form two new Vectors A' and B' and perform the similarity between these two.
These vectors have a dimension for each term, and you have 65 terms:
property 1: true and false
property 2: true and false
property 3: 0 through 50
property 4: 0 through 10
So A' and B' will be vectors of length 65 and each element will be either 0 or 1:
A'(0) = 1 if A(0) = true, and 0 otherwise
A'(1) = 1 if A(0) = false, and 0 otherwise
etc.
Clearly, you can see that this is inefficient. You don't actually need to calculate A' or B' to use cosine similarity with td/idf; you can just pretend you calculated them and perform the calculation on A and B. Note that length(A') = length(B') = sqrt(4) because there will be exactly 4 ones in A' and B'.
td/idf may not be your best bet though, if you want to take care of similarities within properties 3 and 4. That is, with td/idf, a property 3 value of 40 is different than a property 3 value of 41 and different than a property 3 value of 12. However, 41 is not considered "farther away" from 40 than 12; they are all just different terms.
So, if you want property 3 and 4 to incorporate a distance (1 is really close to 2 and 50 is far form 2) then you have to define a distance metric. And if you want to weigh the Boolean values more or less than properties 3 and 4, you will have to define a different distance metric too. If these are things you want to do, forget about cosine and just come up with a value.
Here's an example:
distance = abs(A.property1 - B.property1) * 5 +
abs(A.property2 - B.property2) * 5 +
abs(A.property3 - B.property3) / 51 * 1 +
abs(A.property4 - B.property4) / 10 * 2
And then the similarity = (the maximum of all distances) - distance;
Or, if you like, similarity = 1 / distance.
You can really define it how ever you like. And if you need the similarity to be between 0 and 1, then normalize by dividing by the maximum possible distance.

Resources