What does the value list mean in a Decision Tree graph - decision-tree

While viewing this question scikit learn - feature importance calculation in decision trees, I have trouble understanding the value list of the Decision Tree. For example, the top node has value=[1,3]. What exactly are 1 and 3? Does it mean if X[2]<= 0.5, then 1 false, 3 true? If so, the value list is [number of false cases, number of true cases]. If so, what about the value lists of the leaves?
Why do three right leaves have [0,1] and one left leaf has [1,0]?
What does [1,0] or [0,1] mean anyway? One false zero true or zero false one true? But there's no condition on the leaves (like something <=.5). Then what is true what is false?
Your advice is highly appreciated!

value=[1,3] means that, in this exactly leaf of the tree (before applying the filter x[2] <=0.5), you have:
1 sample of the class 0
3 sample of the class 1
Once you are going down the tree, you are filtering. Your objective is have perfectly separated classes. So you tend to have something like value=[0,1], which means that after applying all filters, you have 0 samples of class 0 and 1 samples of class 1.
You can also check that the sum of value is always similar to the samples. This makes completely sense since value is only telling you how all samples that arrived this leaf are distributed.

Related

Creating a dynamic array with given probabilities in Excel

I want to create a dynamic array that returns me X values based on given probabilities. For instance:
Imagine this is a gift box and you can open the box N times. What I want is to have N random results. For example, I want to get randomly 5 of these two rarities but based on their chances.
I have this following formula for now:
=index(A2:A3,randarray(5,1,1,rows(A2:A3),1). And this is the output I get:
The problem here is that I have a dynamic array with the 5 results BUT NOT BASED ON THE PROBABILITIES.
How can I add probabilities to the array?
Here is how you could generate a random outcome with defined probabilities for the entries (Google Sheets solution, not sure about Excel):
=ARRAYFORMULA(
VLOOKUP(
RANDARRAY(H1, 1),
{
{0; OFFSET(C2:C,,, COUNTA(C2:C) - 1)},
OFFSET(A2:A,,, COUNTA(C2:C))
},
2
)
)
This whole subject of random selection was treated very thoroughly in Donald Knuth's series of books, The Art of Computer Programming, vol 2, "Semi-Numerical Algorithms". In that book he presents an algorithm for selecting exactly X out of N items in a list using pseudo-random numbers. What you may not have considered is that after you have chosen your first item the probability array has changed to (X-1)/(N-1) if your first outcome was "Normal" or X/(N-1) if your first outcome was "Rare". This means you'll want to keep track of some running totals based on your prior outcomes to ensure your probabilities are dynamically updated with each pick. You can do this with formulas, but I'm not certain how the back-reference will perform inside an array formula. Microsoft's dynamic array documentation indicates that such internal array references are considered "circular" and are prohibited.
In any case, trying to extend this to 3+ outcomes is very problematic. In order to implement that algorithm with 3 choices (X + Y + Z = N picks) you would need to break this up into one random number for an X or not X choice and then a second random number for a Y or not Y choice. This becomes a recursive algorithm, beyond Excel's ability to cope in formulas.

A composite result based on three variables

I've searched the site and I think I have a unique excel question. I'm new to using advanced functions in excel to analyze data. I have a question about how to design a formula to give a composite result between three variables. So the scenarios I'm working with are:
If A=Positive, B =Positive, C= Positive then the overall composite result to be positive.
If A=Negative, B = Positive, C= Positive, the overall composite result to be negative.
If A = Negative, B =Negative, and C=Negative then the overall composite result to be negative.
If A = Positive, B =Negative, C= Positive, then the overall composite result to be positive.
Any help will be greatly appreciated.
Are you are looking for a formula something like this one? ...
=IF(AND(A1>0,B1>0,C1>0),"Positive",IF(AND(A1<0,B1<0,C1<0),"Negative",IF(OR(A1<0,B1<0),"Negative","ELSE condition when none of the earlier Condition is met")))
From Digital Logic 101, you need 8 cases for an exhaustive list of the all the possible combinations. And that is if you ignore the case 0 and suppose it is not present in your data. In that case, you seem looking for something like this:
=SIGN(A1*B1*C1) * something
or
=IF(A1*B1*C1 >=0, "Positive", "negative")

Logical operator on named range in a sumproduct, weighted histogram (excel)

So I wrote some code a few years ago that generates a spreadsheet that does some neat stuff. It involves doing some weird stuff for a histogram. I just came across a need to update/use this for a different project, and there is some code in there I really don't entirely understand.
Basically it's a logical operator working on a named range that then gets used in a sumproduct. And for the life of me I don't entirely get why it works, but it does. Here's the offending line.
=SUMPRODUCT((tblC62>G90)*(tblC62<=G91) * tblC2wgt2)
tblC62 and tblC2wgt2 reference some paired data. Each record in tblC62 lines up with a record from tblC2wgt2. The purpose is to create a weighted histogram. tblC2wgt2 provides the weight. The data is binned by tblC62 values and the bin range is defined by G90 and G91 and tblC62. So it will define.
And this works. I've checked it thoroughly. And I don't understand why. It's the tblC62 logicals being multiplied that that is the most confusing.
Anyways, I have to explain the math to my boss shortly.....so if anyone can explain to me how this code works I would appreciate it.
It is using the inherent values of the Boolean True/False. When using them in math they automaticaly revert to their values of 1/0 respectively.
So in:
=SUMPRODUCT((tblC62>G90)*(tblC62<=G91) * tblC2wgt2)
When (tblC62>G90) is true its value is 1 and when false it is 0. The same with (tblC62<=G91)
So when both are true we get 1 * 1 which equals 1. If either are false we get 0 * 1 which equals 0.
Then the result of that is multiplied with tblC2wgt2. So when either or both are false it is 0 * tblC2wgt2 which = 0. When both are true we get 1 * tblC2wgt2 = tblC2wgt2.
The Sumproduct then adds up all the variations.

Binary search - worst/avg case

I'm finding it difficult to understand why/how the worst and average case for searching for a key in an array/list using binary search is O(log(n)).
log(1,000,000) is only 6. log(1,000,000,000) is only 9 - I get that, but I don't understand the explanation. If one did not test it, how do we know that the avg/worst case is actually log(n)?
I hope you guys understand what I'm trying to say. If not, please let me know and I'll try to explain it differently.
Worst case
Every time the binary search code makes a decision, it eliminates half of the remaining elements from consideration. So you're dividing the number of elements by 2 with each decision.
How many times can you divide by 2 before you are down to only a single element? If n is the starting number of elements and x is the number of times you divide by 2, we can write this as:
n / (2 * 2 * 2 * ... * 2) = 1 [the '2' is repeated x times]
or, equivalently,
n / 2^x = 1
or, equivalently,
n = 2^x
So log base 2 of n gives you x, which is the number of decisions being made.
Finally, you might ask, if I used log base 2, why is it also OK to write it as log base 10, as you have done? The base does not matter because the difference is only a constant factor which is "ignored" by Big O notation.
Average case
I see that you also asked about the average case. Consider:
There is only one element in the array that can be found on the first try.
There are only two elements that can be found on the second try. (Because after the first try, we chose either the right half or the left half.)
There are only four elements that can be found on the third try.
You can see the pattern: 1, 2, 4, 8, ... , n/2. To express the same pattern going in the other direction:
Half the elements take the maximum number of decisions to find.
A quarter of the elements take one fewer decision to find.
etc.
Since half of the elements take the maximum amount of time, it doesn't matter how much less time the other elements take. We could assume that all elements take the maximum amount of time, and even if half of them actually take 0 time, our assumption would not be more than double whatever the true average is. We can ignore "double" since it is a constant factor. So the average case is the same as the worst case, as far as Big O notation is concerned.
For binary search, the array should be arranged in ascending or descending order.
In each step, the algorithm compares the search key value with the key value of the middle element of the array.
If the keys match, then a matching element has been found and its index, or position, is returned.
Otherwise, if the search key is less than the middle element's key, then the algorithm repeats its action on the sub-array to the left of the middle element.
Or, if the search key is greater,then the algorithm repeats its action on the sub-array to the right.
If the remaining array to be searched is empty, then the key cannot be found in the array and a special "not found" indication is returned.
So, a binary search is a dichotomic divide and conquer search algorithm. Thereby it takes logarithmic time for performing the search operation as the elements are reduced by half in each of the iteration.
For sorted lists which we can do a binary search, each "decision" made by the binary search compares your key to the middle element, if greater it takes the right half of the list, if less it will take the left half of the list (if it's a match it will return the element at that position) you effectively reduce your list by half for every decision yielding O(logn).
Binary search however, only works for sorted lists. For un-sorted lists you can do a straight search starting with the first element yielding a complexity of O(n).
O(logn) < O(n)
Although it entirely depends on how many searches you'll be doing, your inputs, etc what your best approach would be.
For Binary search the prerequisite is a sorted array as input.
• As the list is sorted:
• Certainly we don't have to check every word in the dictionary to look up a word.
• A basic strategy is to repeatedly halve our search range until we find the value.
• For example, look for 5 in the list of 9 #s below.v = 1 1 3 5 8 10 18 33 42
• We would first start in the middle: 8
• Since 5<8, we know we can look at just the first half: 1 1 3 5
• Looking at the middle # again, narrow down to 3 5
• Then we stop when we're down to one #: 5
How many comparison is needed: 4 =log(base 2)(9-1)=O(log(base2)n)
int binary_search (vector<int> v, int val) {
int from = 0;
int to = v.size()-1;
int mid;
while (from <= to) {
mid = (from+to)/2;
if (val == v[mid])
return mid;
else if (val > v[mid])
from = mid+1;
else
to = mid-1;
}
return -1;
}

Extend value to arithmetic mean

Might be a quite stupid question and I'm not sure if it belongs here or to math.
My problem:
I have several elements of type X which have a boolean attribute Y.
To calculate the percentage of elements where Y is true, I count all X where Y is true and divide it by the number of elements.
But I don't want to iterate all the time above all elements to update that percentage-value.
My idea was:
If I had 33% for 3 elements, and am adding a fourth one where Y is true:
(0.33 * 3 + 1) / 4 = 0.4975
Obviously that does not work well because of the 0.33.
Is there any way for getting an accurate solution without iteration or saving the number of items where Y is true?
Keep a count of the total number of elements and of the "true" ones. Global vars, object member variables, whatever. I assume that sometime back when the program is starting, you have zero elements. Every time an element is added, removed, or its boolean attribute changes, increment or decrement those counts as appropriate. You'll never have to iterate over the list (except maybe for testing) but at the cost of every change to the list having to include fiddling with those variables.
Your idea doesn't work because 0.33 does not equal 1/3. It's an approximation. If you take the exact value, you get the right answer:
(1/3 * 3 + 1) / 4 = (1 + 1) / 4 = 1/2
My question is, if you can store the value of 33% without iterating, why not just store the values of 1 and 3 and calculate them? That is, just keep a running total of the number of true values and number of objects. Increment when you get new ones. Calculate on demand. It's not necessary to iterate every time is way.

Resources