IF statement for binning data - excel

I have a list of numbers in the 1st column. Based on the number in the 1st column I want to give each line another number i.e. if cell A2 has a value of bigger than 1300 and less than 1400, in B2 I want the cell to show 6.75.
If A1 has a value of 1350, then B1 will update with a value of "6.75".
If A1 has a value of 1450, then B1 will update with a value of "7.25", and so on.
There are 17 groupings that I need:
<1300 >1400 =6.75
<1400 >1500 =7.25
<1500 >1600 =7.75
<1600 >1700 =8.25
.
.
.
Bigger than 2900 =14.75
I could have numerous values on the spreadsheet in the 1st column so need to put them into a grouping bucket using some formula.
Any ideas?

Something like =VLOOKUP(A3,bArray,2) should suit, copied to suit, where bArray is the name for a two column list of the breakpoints and the values up to the respective breakpoint.
The break points may require slight adjustment to suit whatever is actually required.

For a simple linear relationship like that, you could use a formula along the lines of:
=if(a1<1300, 0, if(a1>=2900, 14.75, (trunc(a1 / 100, 0) - 13) * 0.5 + 6.75))
In other words, check the too-low and too-high values first to deliver fixed results, otherwise use the final formula to convert to the desired number.
This involves dividing by 100 to turn (for example) 1727 into 17, subtracting 13 to get 4, multiplying that by 0.5 and adding the 6.25 base to get 8.75.
That will give you what you've asked for:
x < 1300: 0.00
1300 <= x < 1400: 6.75
1400 <= x < 1500: 7.25
1500 <= x < 1600: 7.75
1600 <= x < 1700: 8.25
1700 <= x < 1800: 8.75
1800 <= x < 1900: 9.25
1900 <= x < 2000: 9.75
2000 <= x < 2100: 10.25
2100 <= x < 2200: 10.75
2200 <= x < 2300: 11.25
2300 <= x < 2400: 11.75
2400 <= x < 2500: 12.25
2500 <= x < 2600: 12.75
2600 <= x < 2700: 13.25
2700 <= x < 2800: 13.75
2800 <= x < 2900: 14.25
2900 <= x : 14.75
You can see it in action from the following screen shot, showing the edge cases:
Note that you have a problem with your description of numbers like 1400 since you don't specify which range they should fall in. For the formula given above, the ranges are inclusive at the low end and exclusive at the high end (such as 1300..1399.9999).
If the relationship isn't so linear (or, more accurately, formulaic), you will probably need to consider the use of lookup tables as per the excellent suggestion by pnuts.

Related

Excel Formula/Function to Subtract randomly

I'm rather new to excel and I want to use excel or perhaps another program to subtract on a fixed amount but randomly.
It is like n1 + n2 + n3 = 300
But I want n1 and n2 and n3 to be different numbers hence not division
Examples
150 + 75 + 75 = 300
or
100 + 100 + 100 = 300
or
50 + 100 + 150 = 300
A function to subtract a fixed amount but random subtraction
I'm still quite confused on how to do this on excel, sorry for my bad English and explaination
Please help.
If it is always 3 numbers:
The first number we use:
=RANDBETWEEN(1,A1)
Then the second:
=RANDBETWEEN(1,A1-B1)
Then the third is just the remainder:
=A1-B1-C1
In excel you have your numbers in rows and columns
I would enter 100, 100 , 100 in separate columns of the same row and then use this
formula
=(A1+B1+C1).
Meaning row A1 100, row B1 100, row C1 100. The answer should be 300
if you want to subtract just use the minus sign instead of the plus sign:
formula =(A1-B1-C1).
Try:
=LET(A,RANDBETWEEN(0,300),B,RANDBETWEEN(0,300-A),HSTACK(A,B,300-SUM(A,B)))
Or, if no HSTACK() available:
=LET(A,RANDBETWEEN(0,300),B,RANDBETWEEN(0,300-A),CHOOSE({1,2,3},A,B,300-SUM(A,B)))
I'm just unsure if pure mathematically this is as random as can be. Would it be 'more random' if one would list all possible permutation of 1-300 that would add up to 300 and randomly pick a solution from this list?
One way to approach this if you want two numbers to always add up to the same value, is to modify each value with the same amount, by adding to one value and subtracting from the other
100 = n1 + n2 = (n1+a) + (n1-a)
where a is any random number.
To extend this to three numbers, you can use two artbitrary numbers a and b to do this following
100 = n1 + n2 + n3 = (n1+a) + (n2-a+b) + (n3-b)
The simplified approach to this it to pick completely random n2 and n3 and let n1 pick up the difference
100 = (100-n2-n3) + n2 + n3
To do this in Excel, use the =RANDBETWEEN() function for n2 and n3 and then for n1 just subtract from 100
=100 - SUM(n2,n3)

Updating Pandas data fram cells by condition

I have a data frame and want to update specific cells in a column based on a condition on another column.
ID Name Metric Unit Value
2 1 K2 M1 msecond 1
3 1 K2 M2 NaN 10
4 2 K2 M1 usecond 500
5 2 K2 M2 NaN 8
The condition is, if Unit string is msecond, then multiply the corresponding value in Value column by 1000 and store it in the same place. Considering a constant step for row iteration (two-by-two), the following code is not correct
i = 0
while i < len(df_group):
x = df.iloc[i].at["Unit"]
if x == 'msecond':
df.iloc[i].at["Value"] = df.iloc[i].at["Value"] * 1000
i += 2
However, the output is the same as before modifications. How can I fix that? Also what are the alternatives for better coding instead of that while loop?
A much simpler (and more efficient) form would be to use loc:
df.loc[df['Unit'] == 'msecond', 'Value'] *= 100
If you consider it essentially to only update a specific step of indexes:
step = 2
start = 0
df.loc[df['Unit'].eq('msecond') & (df.index % step == start), 'Value'] *= 100

Excel: Rewriting Data into Buckets

I have the following values in an excel column:
11
84
167
241
520
I want to rewrite these column values as a group such that:
if cell value < 50 then A
if 50 < cell value < 100 then B
if 100 < cell value < 150 then C
if 150 < cell value < 250 then D
if cell value > 250 then E
I tried the following logic but it shows A for cell A1 and false for other values:
=IF(A1<50,"A",IF(50<A1<100,"B",IF(100<A1<150,"C",IF(150<A1<250,"D",IF(A1>250,"E")))))
We do not need to put the less than "< " expression because if it was less than "<" an earlier if then statement would have returned true. Note that a number such as 50, 100, 150 250 will return false as we are not using an inclusive less than expression. Less than or equals to <= may be what you need, but from your example I cant tell
=IF(A1<50,"A",IF(A1<100,"B",IF(A1<150,"C",IF(A1<250,"D",IF(A1>250,"E")))))
For future reference you have to break this up into separate statements
AND(50<A1,A1<100,"B")
vs
IF(50<A1<100,"B")

Excel split given number into sum of other numbers

I'm trying to write formulae that will split a given number into the sum of 4 other numbers.
The other numbers are 100,150,170 and 200 so the formula would be
x = a*100+b*150+c*170+d*200 where x is the given number and a,b,c,d are integers.
My spreadsheet is set up as where col B are x values, and C,D,E,F are a,b,c,d respectively (see below).
B | C | D | E | F |
100 1 0 0 0
150 0 1 0 0
200 0 0 0 1
250 1 1 0 0
370 0 0 1 1
400 0 0 0 2
I need formulae for columns C,D,E,F (which are a,b,c,d in the formula)
Your help is greatly appreciated.
UPDATE:
Based on the research below, for input numbers greater than 730 and/or for all actually divisible input numbers use the following formulas:
100s: =CHOOSE(MOD(ROUNDUP([#number]/10;0); 20)+1;
0;1;1;0;1;1;0;1;0;0;1;0;0;1;0;0;1;0;1;1)
150s: =CHOOSE(MOD(ROUNDUP([#number]/10;0); 10)+1;
0;0;1;1;0;1;1;0;0;1)
170s: =CHOOSE(MOD(ROUNDUP([#number]/10;0); 5)+1;
0;3;1;4;2)
200s: =CEILING(([#number]-930)/200;1) +
CHOOSE(MOD(ROUNDUP([#number]/10;0); 20)+1;
4;1;2;0;2;3;1;3;1;2;4;2;3;0;2;3;0;3;0;1)
MOD(x; 20) will return numbers 0 - 19, CHOOSE(x;a;b;...) will return n-th argument based on the first argument (1=>second argument, ...)
see more info about CHOOSE
use , instead of ; based on your Windows language&region settings
let's start with the assumption that you want to preferably use 200s over 170s over 150s over 100s - i.e. 300=200+100 instead of 300=2*150 and follow the logical conclusions:
the result set can only contain at most 1 100, at most 1 150, at most 4 170s and unlimited number of 200s (i started with 9 170s because 1700=8x200+100, but in reality there were at most 4)
there are only 20 possible subsets of (100s, 150s, 170s) - 2*2*5 options
930 is the largest input number without any 200s in the result set
based on observation of the data points, the subset repeats periodically for
number = 740*k + 10*l, k>1, l>0 - i'm not an expert on reverse-guessing on periodic functions from data, but here is my work in progress (charted data points are from the table at the bottom of this answer)
the functions are probably more complicated, if i manage to get them right, i'll update the answer
anyway for numbers smaller than 740, more tweaking of the formulas or a lookup table are needed (e.g. there is no way to get 730, so the result should be the same as for 740)
Here is my solution based on lookup formulas:
Following is the python script i used to generate the data points, formulas from the picture and the 60-row table itself in csv format (sorted as needed by the match function):
headers = ("100s", "150s", "170s", "200s")
table = {}
for c200 in range(30, -1, -1):
for c170 in range(9, -1, -1):
for c150 in range(1, -1, -1):
for c100 in range(1, -1, -1):
nr = 200*c200 + 170*c170 + 150*c150 + 100*c100
if nr not in table and nr <= 6000:
table[nr] = (c100, c150, c170, c200)
print("number\t" + "\t".join(headers))
for r in sorted(table):
c100, c150, c170, c200 = table[r]
print("{:6}\t{:2}\t{:2}\t{:2}\t{:2}".format(r, c100, c150, c170, c200))
__________
=IF(E$1<740; 0; INT((E$1-740)/200))
=E$1 - E$2*200
=MATCH(E$3; table[number]; -1)
=INDEX(table[number]; E$4)
=INDEX(table[100s]; E$4)
=INDEX(table[150s]; E$4)
=INDEX(table[170s]; E$4)
=INDEX(table[200s]; E$4) + E$2
__________
number,100s,150s,170s,200s
940,0,0,2,3
930,1,1,4,0
920,0,1,1,3
910,0,0,3,2
900,1,0,0,4
890,0,1,2,2
880,0,0,4,1
870,1,0,1,3
860,0,1,3,1
850,1,1,0,3
840,1,0,2,2
830,0,1,4,0
820,1,1,1,2
810,1,0,3,1
800,0,0,0,4
790,1,1,2,1
780,1,0,4,0
770,0,0,1,3
760,1,1,3,0
750,0,1,0,3
740,0,0,2,2
720,0,1,1,2
710,0,0,3,1
700,1,0,0,3
690,0,1,2,1
680,0,0,4,0
670,1,0,1,2
660,0,1,3,0
650,1,1,0,2
640,1,0,2,1
620,1,1,1,1
610,1,0,3,0
600,0,0,0,3
590,1,1,2,0
570,0,0,1,2
550,0,1,0,2
540,0,0,2,1
520,0,1,1,1
510,0,0,3,0
500,1,0,0,2
490,0,1,2,0
470,1,0,1,1
450,1,1,0,1
440,1,0,2,0
420,1,1,1,0
400,0,0,0,2
370,0,0,1,1
350,0,1,0,1
340,0,0,2,0
320,0,1,1,0
300,1,0,0,1
270,1,0,1,0
250,1,1,0,0
200,0,0,0,1
170,0,0,1,0
150,0,1,0,0
100,1,0,0,0
0,0,0,0,0
Assuming that you want as many of the highest values as possible (so 500 would be 2*200 + 100) try this approach assuming the number to split in B2 down:
Insert a header row with the 4 numbers, e.g. 100, 150, 170 and 200 in the range C1:F1
Now in F2 use this formula:
=INT(B2/F$1)
and in C2 copied across to E2
=INT(($B2-SUMPRODUCT(D$1:$G$1,D2:$G2))/C$1)
Now you can copy the formulas in C2:F2 down all columns
That should give the results from your table

Minimum number of training examples for Find-S/Candidate Elimination algorithms?

Consider the instance space consisting of integer points in the x, y plane, where 0 ≤ x, y ≤ 10, and the set of hypotheses consisting of rectangles (i.e. being of the form (a ≤ x ≤ b, c ≤ y ≤ d), where 0 ≤ a, b, c, d ≤ 10).
What is the smallest number of training examples one needs to provide so that the Find-S algorithm perfectly learns a particular target concept (e.g. (2 ≤ x ≤ 4, 6 ≤ y ≤ 9))?
When can we say that the target concept is exactly learned in the case of the Find-S algorithm, and what is the optimal query strategy?
I'd also like to know the answer w.r.t Candidate Elimination.
Thanks in advance.
You need two positive examples: (2,6)
(2 <= x <= 2, 6 <= y <= 6)
and then (4,9)
(2 <= x <= 4, 6 <= y <= 9)
That is the S set done and this is the end of the answer to teaching/learning with FIND-S
With Candidate elimination, we need to give negative examples to build the G set.
We need four negative examples to define the four boundaries of the rectangle:
G starts as (-Inf <= x <= Inf, -Inf <= y <= Inf)
Add (3,5)- and we get hypothesis:
(-Inf <= x <= Inf, 6 <= y <= Inf)
Add (3,10)-
(-Inf <= x <= Inf, 6 <= y <= 9)
Add (1,7)-
(2 <= x <= Inf, 6 <= y <= 9)
Add (5,7)-
(2 <= x <= 4, 6 <= y <= 9)
So now S=G={(2 <= x <= 4, 6 <= y <= 9)}. As S=G, it has perfectly learned the concept.
I have seen this question in different formats. Replace -Inf with 0 and Inf with 10 if it specifies the problem domain as such.
This is the optimal order to feed in the training examples. The worst order is to do the G set first, as you will create four different candidate hypotheses, which will merge to three with the second example and then merge to one with the 3rd example. It is useful to illustrate C-E with a tree as in the Mitchell book, and perhaps sketch the hypothesis graph next to each.
This answer is confirmed here:
http://ssdi.di.fct.unl.pt/scl/docs/exercises/Clemens%20Dubslaff%20hm4.pdf
Assuming all ranges are a ≤ x ≤ b and a and b are integer then...
In a 1 dimensional case (only x) there would be 4 samples, (a-1,a,b,b+1) that would prove it.
If you extend that to 2 dimensions (x and y) it should be 16 samples, which are those above as x, and (c-1,c,d,d+1) for y, with all possible combinations.
Please correct me if I don't understand the problem.

Resources