Excel Sumproduct counts rows incorrectly with multiple column criteria - excel

I am trying to calculate Precision and Recall for the output of a model which makes 3 predictions whether a person's name is John or not.
The ground truth for each entry/row is column A. The predictions are stored in columns O,Q and S. The model only needs 2 out of 3 predictions to be > 50% each to be considered correct.
Therefore a True Positive is when >=2 of O,Q,S are > 50%.
Similarly, a False Negative is when < 2 of O,Q,S are > 50%.
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
I can calculate precision ok because the final logic operator is >= and therefore the values cannot be 0 to be counted. But for this, the final SUM in the denominator is problematic, counting all rows.
This is the part that works, the Precision:
SUM(IF((A2:A300="John")*((O2:O300>=.5)+(Q2:Q300>=.5)+(S2:S300>=.5))>=2,1)) / (SUM(IF((A2:A300="John")*((O2:O300>=.5)+(Q2:Q300>=.5)+(S2:S300>=.5))>=2,1)) + (SUM(IF((A2:A300="Not John")*((O2:O300>=.5)+(Q2:Q300>=.5)+(S2:S300>=.5))>=2,1)))
And this is what I'm trying but doesn't work. The last < operator screws up the denominator and I can't figure out how to fix:
SUM(IF((A2:A300="John")*((O2:O300>=.5)+(Q2:Q300>=.5)+(S2:S300>=.5))>=2,1)) / (SUM(IF((A2:A300="John")*((O2:O300>=.5)+(Q2:Q300>=.5)+(S2:S300>=.5))>=2,1)) + (SUM(IF((A2:A300="John")*((O2:O300>=.5)+(Q2:Q300>=.5)+(S2:S300>=.5))<2,1)))
If there are 3 rows where A = "John" with only 2 of those rows have 2 of O,Q,S > 50%
And if there are 3 rows where A = "Not John" with all 3 rows have O,Q,S > 50%
Then,
Precision = 2 / (2 + 3) = 2/5
Recall = 2 / (2 + 1) = 2/3

You can fix the recall by putting an extra set of brackets round the <2 comparison:
=SUM(IF((A2:A7="John")*((O2:O7>=0.5)+(Q2:Q7>=0.5)+(S2:S7>=0.5))>=2,1))/(SUM(IF((A2:A7="John")*((O2:O7>=0.5)+(Q2:Q7>=0.5)+(S2:S7>=0.5))>=2,1))+(SUM(IF((A2:A7="John")*(((O2:O7>=0.5)+(Q2:Q7>=0.5)+(S2:S7>=0.5))<2),1))))
so you avoid multiplying by zero for the 'Not John' rows before doing the comparison and get the correct denominator.

Related

Binary Search Complexity

What is the time complexity of binary search taking an array of n elements as input from user..
As the time complexity of binary search is O(log n)
Whereas, the time complexity of taking the array as input from user is O(n)
For the whole program the answer is: O(logn) + O(n) = O(n)
Why is it? because your input of n numbers into an array and binary search is independent of each other.
Now we need to understand time complexity in general:
In the simplest terms, for a problem where the input size is n:
Best case = fastest time to complete, with optimal inputs chosen.
For example, the best case for a sorting algorithm would be data that's already sorted.
Worst case = slowest time to complete, with pessimal inputs are chosen.
For example, the worst case for a sorting algorithm might be data that are sorted in reverse order (but it depends on the particular algorithm).
Average case = arithmetic mean. Run the algorithm many times, using many different inputs of size n that come from some distribution that generates these inputs (in the simplest case, all the possible inputs are equally likely), and compute the total running time (by adding the individual times), and divide by the number of trials. You may also need to normalize the results based on the size of the input sets.
Example (Binary search):
Suppose we have the following binary search function:
binarySearch(arr, x, low, high)
repeat till low = high
mid = (low + high)/2
if (x == arr[mid])
return mid
else if (x > arr[mid]) // x is on the right side
low = mid + 1
else // x is on the left side
high = mid - 1
Now, let's analyze its time complexity.
Best Case Time Complexity of Binary Search
The best case of Binary Search occurs when:
The element to be searched is in the middle of the list
In this case, the element is found in the first step itself and this involves 1 comparison.
Therefore, Best Case Time Complexity of Binary Search is O(1).
Average Case Time Complexity of Binary Search
Let the input be N distinct numbers: a1, a2, ..., a(N-1), aN
We need to find element P.
There are two cases:
Case 1: The element P can be in N distinct indexes from 0 to N-1.
Case 2: There will be a case when the element P is not present in the list.
There are N case 1 and 1 case 2. So, there are N+1 distinct cases to consider in total.
If element P is in index K, then Binary Search will do K+1 comparisons.
This is because:
The element at index N/2 can be found in 1 comparison as Binary Search starts from middle.
Similarly, in the 2nd comparisons, elements at index N/4 and 3N/4 are compared based on the result of 1st comparison.
On this line, in the 3rd comparison, elements at index N/8, 3N/8, 5N/8, 7N/8 are compared based on the result of 2nd comparison.
Based on this, we know that:
Elements requiring 1 comparison: 1
Elements requiring 2 comparisons: 2
Elements requiring 3 comparisons: 4
Therefore, Elements requiring I comparisons: 2^(I-1)
The maximum number of comparisons = Number of times N is divided by 2 so that result is 1 = Comparisons to reach 1st element = logN comparisons
I can vary from 0 to logN
Total number of comparisons = 1 * (Elements requiring 1 comparison) + 2 * (Elements requiring 2 comparisons) + ... + logN * (Elements requiring logN comparisons)
Total number of comparisons = 1 * (1) + 2 * (2) + 3 * (4) + ... + logN * (2^(logN-1))
Total number of comparisons = 1 + 4 + 12 + 32 + ... = 2^logN * (logN - 1) + 1
Total number of comparisons = N * (logN - 1) + 1
Total number of cases = N+1
Therefore, average number of comparisons = ( N * (logN - 1) + 1 ) / (N+1)
Average number of comparisons = N * logN / (N+1) - N/(N+1) + 1/(N+1)
The dominant term is N * logN / (N+1) which is approximately logN. Therefore, Average Case Time Complexity of Binary Search is O(logN).Let there be N distinct numbers: a1, a2, ..., a(N-1), aN
We need to find element P.
There are two cases:
Case 1: The element P can be in N distinct indexes from 0 to N-1.
Case 2: There will be a case when the element P is not present in the list.
There are N case 1 and 1 case 2. So, there are N+1 distinct cases to consider in total.
If element P is in index K, then Binary Search will do K+1 comparisons.
This is because:
The element at index N/2 can be found in 1 comparison as Binary Search starts from the middle.
Similarly, in the 2nd comparison, elements at index N/4 and 3N/4 are compared based on the result of the 1st comparison.
On this line, in the 3rd comparison, elements at index N/8, 3N/8, 5N/8, and 7N/8 are compared based on the result of the 2nd comparison.
Based on this, we know that:
Elements requiring 1 comparison: 1
Elements requiring 2 comparisons: 2
Elements requiring 3 comparisons: 4
Therefore, Elements requiring I comparisons: 2^(I-1)
The maximum number of comparisons = Number of times N is divided by 2 so that result is 1 = Comparisons to reach 1st element = logN comparisons
I can vary from 0 to logN
Total number of comparisons = 1 * (Elements requiring 1 comparison) + 2 * (Elements requiring 2 comparisons) + ... + logN * (Elements requiring logN comparisons)
Total number of comparisons = 1 * (1) + 2 * (2) + 3 * (4) + ... + logN * (2^(logN-1))
Total number of comparisons = 1 + 4 + 12 + 32 + ... = 2^logN * (logN - 1) + 1
Total number of comparisons = N * (logN - 1) + 1
Total number of cases = N+1
Therefore, average number of comparisons = ( N * (logN - 1) + 1 ) / (N+1)
Average number of comparisons = N * logN / (N+1) - N/(N+1) + 1/(N+1)
The dominant term is N * logN / (N+1) which is approximately logN. Therefore, the Average Case Time Complexity of Binary Search is O(logN).
Worst Case Time Complexity of Binary Search
The worst case of Binary Search occurs when:
The element to search is in the first index or last index
In this case, the total number of comparisons required is logN comparisons.
Therefore, the Worst Case Time Complexity of Binary Search is O(logN).

Why scikit learn confusion matrix is reversed?

I have 3 questions:
1)
The confusion matrix for sklearn is as follows:
TN | FP
FN | TP
While when I'm looking at online resources, I find it like this:
TP | FP
FN | TN
Which one should I consider?
2)
Since the above confusion matrix for scikit learn is different than the one I find in other rescources, in a multiclass confusion matrix, what's the structure will be? I'm looking at this post here:
Scikit-learn: How to obtain True Positive, True Negative, False Positive and False Negative
In that post, #lucidv01d had posted a graph to understand the categories for multiclass. is that category the same in scikit learn?
3)
How do you calculate the accuracy of a multiclass? for example, I have this confusion matrix:
[[27 6 0 16]
[ 5 18 0 21]
[ 1 3 6 9]
[ 0 0 0 48]]
In that same post I referred to in question 2, he has written this equation:
Overall accuracy
ACC = (TP+TN)/(TP+FP+FN+TN)
but isn't that just for binary? I mean, for what class do I replace TP with?
The reason why sklearn has show their confusion matrix like
TN | FP
FN | TP
like this is because in their code, they have considered 0 to be the negative class and one to be positive class. sklearn always considers the smaller number to be negative and large number to positive. By number, I mean the class value (0 or 1). The order depends on your dataset and class.
The accuracy will be the sum of diagonal elements divided by the sum of all the elements.p The diagonal elements are the number of correct predictions.
As the sklearn guide says: "(Wikipedia and other references may use a different convention for axes)"
What does it mean? When building the confusion matrix, the first step is to decide where to put predictions and real values (true labels). There are two possibilities:
put predictions to the columns, and true labes to rows
put predictions to the rows, and true labes to columns
It is totally subjective to decide which way you want to go. From this picture, explained in here, it is clear that scikit-learn's convention is to put predictions to columns, and true labels to rows.
Thus, according to scikit-learns convention, it means:
the first column contains, negative predictions (TN and FN)
the second column contains, positive predictions (TP and FP)
the first row contains negative labels (TN and FP)
the second row contains positive labels (TP and FN)
the diagonal contains the number of correctly predicted labels.
Based on this information I think you will be able to solve part 1 and part 2 of your questions.
For part 3, you just sum the values in the diagonal and divide by the sum of all elements, which will be
(27 + 18 + 6 + 48) / (27 + 18 + 6 + 48 + 6 + 16 + 5 + 21 + 1 + 3 + 9)
or you can just use score() function.
The scikit-learn convention is to place predictions in columns and real values in rows
The scikit-learn convention is to put 0 by default for a negative class (top) and 1 for a positive class (bottom). the order can be changed using labels = [1,0].
You can calculate the overall accuracy in this way
M = np.array([[27, 6, 0, 16], [5, 18,0,21],[1,3,6,9],[0,0,0,48]])
M
sum of diagonal
w = M.diagonal()
w.sum()
99
sum of matrices
M.sum()
160
ACC = w.sum()/M.sum()
ACC
0.61875

Python - Negative numbers not adding, but positive numbers do

I am supposed to add up the rows and the grand total of all the numbers. I can add the grand total well, but I am unable to add the row that has negative numbers only. The following code adds up the positive numbers but do not add up the negative numbers correctly.
grandTotal = 0
sumRow = 0
for x in range(len(numbers)):
sumRow = (sumRow + x)
print(sumRow)
for x in range(len(numbers)):
for y in range(len(numbers[x])):
grandTotal = grandTotal + int(numbers[x][y])
print(grandTotal)
When the user input is:
1,1,-2 -1,-2,-3 1,1,1
My output is: 0
1
3
-3
instead of: 0
-6
3
-3
I know it has something to do with the first for loop, but I can't figure it out. When I try this:
grandTotal = 0
sumRow = 0
for x in range(len(numbers)):
sumRow = (sumRow + (numbers[x]))
print(sumRow)
for x in range(len(numbers)):
for y in range(len(numbers[x])):
grandTotal = grandTotal + int(numbers[x][y])
print(grandTotal)
I get the error message:
File "list.py", line 14, in
sumRow = (sumRow + (numbers[x]))
TypeError: unsupported operand type(s) for +: 'int' and 'list'
Why doesn't my code add up the negative numbers? Any help is greatly appreciated!
Where you say
sumRow = (sumRow + (numbers[x]))
To add integers you say 1+1, not (1+(1)) this would be adding to lists so you could change that.
From my understanding numbers is an array as well, so saying
numbers[x]
Will give you many numbers. What you want is the total for every row, and the total of all rows. Heres a program that does this. I am assuming that your program automatically gets numbers from the user input.
grandTotal = 0
for row in numbers:
#for each row we find total amount
rowTotl = 0
for value in row:
#for each column/ value we add tot the row total
rowTotl += value
print(rowTotl)
#add this row's value to the grandTotal before we move on
grandTotal += rowTotl
#After all adding we print grand total
print(grandTotal)
The reason your program doesn't add negative numbers, is really because the row totals are not adding numbers at all. They are just adding the indexes, rather than the value, so they don't work for positive number either. The grand total works because you are adding all the values properly, rather than adding the indexes. FYI,
for index in range(len(numbers)) :
does not give you the values, but rather : 0,1,2,3,4,5,6...(indexes) till the end of the range, to get the value numbers you would do
for value in numbers:

Calculation of average based on a weighted score

I'm trying to calculate an average score based on a list of parameter scores (between 0 and 5). The trick is that I want to be able to weight each parameter.
Eg:
Parameter A Parameter B Parameter C
Weight 100% 70% 0%
Score 4 5 0
In the above example, the average score should be 3,75 as parameter c is left out.
I've tried with this formula: =IF.ERROR(SUM((A3*A5);(B3*B5);(C3*C5))/COUNTA(A3:C3);""). The formula seems to work if none of the parameters weight is equal to 0. How can I adjust the formula, so it excludes a score if weight is equal to zero?
I think it should be rather easy, I just can't get it to work.
Check this :
=SUMPRODUCT( A2:A4, B2:B4 ) / SUM( B2:B4 )
Source : https://exceljet.net/formula/weighted-average
With COUNTA you are counting the non empty cells, while you should count the non zero cells. So, assuming that the weights are in A3:C3 and the scores in A5:C5:
=IFERROR(SUMPRODUCT(A3:C3;A5*C5)/COUNTIF(A3:C3;">0");"Error: all the weigths are 0")
It would be like this:
(1*4 + 0.7*5) / 2 = 3.75
In other world the formula is:
((WeightA/100 * scoreA) + (WeightB/100 * scoreB) + (WeightC/100 * scoreC)) / 3
=SUMPRODUCT(A1:A3;B1:B3) / COUNTIF(B1:B3;"<>0") / 100
Something like this would work

How many X cells to add to reach Y%

I'm working on this: https://tempfile.me/download/nYhdQHD65GxRzk/ .
And I need to count how many 1 cells should be added to column A to reach a percentage of 1s = 85%.
This is just an example, I can't add cells with 1 and see how many of them I need since it should be automated on a big sample of data.
Expressed as:
.85(count + x) = sum + x
this renders down to x = ( (85 x count) - (100 x sum) ) / (100 - 85) or,
=(85*COUNT(A:A)-100*SUM(A:A))/(100-85)
However, this does not result in an integer, so to ensure 85% is reached:
=ROUNDUP((85*COUNT(A:A)-100*SUM(A:A))/(100-85),0)
The result (234) when added as 1s increases the TRUE total to 284 and the count of all entries to 334, where 284/334 is 85.03%.
It is a lot easier, if you don't look at the percentage which is TRUE or FALSE, but at the count of TRUE and FALSE.
In your example, you have 50 1's and 50 0's.
You want to see how many 1's you have to add in order to get a percentage of 1's of 85%.
In numbers this would look like this:
(x+dx)/(x+dx+y) = 0.85
where x is the number of 1's, y the number of 0's and dx the increase of 1's needed, to get to 85%.
You are looking for dx. So just solve for dx and you get:
dx = (0.85*y-0.15*x)/0.15
Which yields in your example an dx = 234. So you need to add another 234 1's in order to get a (minimum) 85% of 1's.
Hope this is what you wanted. It has however nothing to do with excel.

Resources