Using a subset of excel entries for calculation - excel

Assume that you have 5 cells with values:
[12, 23, 50, 89, 95]
and you are interested in finding the average of the four largest entries (that is drop 12 because it is the smallest).
I wonder how one can do that in excel?

You can get the average of the largest 4 from 5 with this formula
=AVERAGE(LARGE(A1:E1,{1,2,3,4}))
that will only average 4 values even if there are duplicates
Generically if you might have a variable number of values then to average without the smallest value you can use this version
=(SUM(Range)-MIN(Range))/(COUNT(Range)-1)
again that will work OK with duplicates - of course there must be at least 2 numbers in the range

You can use AVERAGEIF(range,condition)
So in your case, it will be AVERAGEIF(A1:E1,">"&MIN(A1:E1))
Hope this helps..

Use =LARGE to get the ǹ-th largest value then use =SUMIF to add if the value is larger than the n-th value!
In pseudocode something like this: =SUMIF(data >= LARGE(range, n))/n, sorry it's been a while since I used excel. `

Related

Counting if part of string is within interval

I am currently trying to check if a number in a comma-separated string is within a number interval. What I am trying to do is to check if an area code (from the comma-separated string) is within the interval of an area.
The data:
AREAS
Area interval
Name
Number of locations
1000-1499
Area 1
?
1500-1799
Area 2
?
1800-1999
Area 3
?
GEOLOCATIONS
Name
Areas List
Location A
1200, 1400
Location B
1020, 1720
Location C
1700, 1920
Location D
1940, 1950, 1730
The result I want here is the number of unique locations in the "Areas list" within the area interval. So Location D should only count ONCE in the 1800-1999 "area", and the Location A the same in the 1000-1499 location. But location B should count as one in both 1000-1499 and one in 1500-1799 (because a number from each interval is in the comma-separated string in "Areas list"):
Area interval
Name
Number of locations
1000-1499
Area 1
2
1500-1799
Area 2
3
1800-1999
Area 3
2
How is this possible?
I have tried with a COUNTIFS, but it doesnt seem to do the job.
Here is one option using FILTERXML():
Formula in C2:
=SUM(FILTERXML("<x><t>"&TEXTJOIN("</s></t><t>",,"1<s>"&SUBSTITUTE(B$7:B$10,", ","</s><s>"))&"</s></t></x>","//t[count(.//*[.>="&SUBSTITUTE(A2,"-","][.<=")&"])>0]"))
Where:
"<x><t>"&TEXTJOIN("</s></t><t>",,"1<s>"&SUBSTITUTE(B$7:B$10,", ","</s><s>"))&"</s></t></x>" - Is the part where we construct a valid piece of XML. The theory here is that we use three axes here. Each t-node will be named a literal 1 to make sure that once we return them with xpath we can sum the result. The outer x-nodes are there to make sure Excel will handle the inner axes correctly. If you are curious to know how this xml-syntax looks at the end, it's best to step through using the 'Evaluate Formula' function on the Data-tab;
//t[count(.//*[.>="&SUBSTITUTE(A2,"-","][.<=")&"])>0]")) - Basically means that we collect all t-nodes where the count of child s-nodes that are >= to the leftmost number and <= to the rightmost number is larger than zero. For A2 the xpath would look like //t[count(.//*[.>=1000][.<=1499])>0]")) after substitution. In short: //t - Select t-nodes, where count(.//* select all child-nodes where count of nodes that fullfill both requirements [.>=1000][.<=1499] is larger than zero;
Since all t-nodes equal the number 1, the SUM() of these t-nodes equals the amount of unique locations that have at least one area in its Areas List;
Important to note that FILTERXML() will result into an error if no t-nodes could be found. That would mean we need to wrap the FILTERXML() in an IFERROR(...., 0) to counter that and make the SUM() still work correctly.
Or, wrap the above in BYROW():
Formula in C2:
=BYROW(A2:A4,LAMBDA(a,SUM(FILTERXML("<x><t>"&TEXTJOIN("</s></t><t>",,"1<s>"&SUBSTITUTE(B$7:B$10,", ","</s><s>"))&"</s></t></x>","//t[count(.//*[.>="&SUBSTITUTE(a,"-","][.<=")&"])>0]"))))
Using MMULT and TEXTSPLIT:
=LET(rng,TEXTSPLIT(D2,"-"),
tarr,IFERROR(--TRIM(TEXTSPLIT(TEXTJOIN(";",,$B$2:$B$5),",",";")),0),
SUM(--(MMULT((tarr>=--TAKE(rng,,1))*(tarr<=--TAKE(rng,,-1)),SEQUENCE(COLUMNS(tarr),,1,0))>0)))
I am in very distinguished company but will add my version anyway as byrow probably is a slightly different approach
=LET(range,B$2:B$5,
lowerLimit,--#TEXTSPLIT(E2,"-"),
upperLimit,--INDEX(TEXTSPLIT(E2,"-"),2),
counts,BYROW(range,LAMBDA(r,SUM((--TEXTSPLIT(r,",")>=lowerLimit)*(--TEXTSPLIT(r,",")<=upperLimit)))),
SUM(--(counts>0))
)
Here the ugly way to do it, with A LOT of helper columns. But not so complicated 🙂
F4= =TRANSPOSE(FILTERXML("<m><r>"&SUBSTITUTE(B4;",";"</r><r>")&"</r></m>";"//r"))
F11= =TRANSPOSE(FILTERXML("<m><r>"&SUBSTITUTE(A11;"-";"</r><r>")&"</r></m>";"//r"))
F16= =SUM(F18:F21)
F18= =IF(SUM(($F4:$O4>=$F$11)*($F4:$O4<=$G$11))>0;1;"")
G18= =IF(SUM(($F4:$O4>=$F$12)*($F4:$O4<=$G$12))>0;1;"")
H18= =IF(SUM(($F4:$O4>=$F$13)*($F4:$O4<=$G$13))>0;1;"")

How can I express easily a formula that has a lot of nesting Ifs

I want to express a formula that says if a number in a column is 50 to 99, then return 50. If 100-149, then return 100, 150-199, then return 150, etc, etc. I need a more concise way to do that for numbers that could reach 2000 (in 50 increments).
Right now my formula is =if(and >50 <100),50,if >100,100,true,0) or something like that, I can't see if right now.
There's probably a faster way, but here's what I would do:
Create a new column that rounds down to the nearest 50:
Assume the numbers are in Column A:
=CONCAT(FLOOR(A2,50),"-",IF(FLOOR(A2,100)-1<FLOOR(A2,50),FLOOR(A2,100)+99,FLOOR(A2,100)-1))
This will produce, for every row, the nearest 50 and nearest 100-1. Also, it allows you to go to 10,000, 50,000, 100,000 and never have to change this formula.
The only thing is adding another nested if for any number below 50, but that's up to you. Otherwise, it shows as 0-99 for any number under 50 and 50-99 for any number below 99 but above 50.
Edit
I found out, after all that work, that you just wanted it rounded down to the nearest 50. Just use =FLOOR(A2, 50)
Divide the number by 50, then multiply the integer of that by 50:
=INT(A1/50)*50
Or subtract half the number and use MROUND:
=MROUND(A1-25,50)

Get sum of 3 large values out of 4 in MS Excel

I need to calculate the best 3 marks of class test assignment and get the sum of them in Excel. I am wondering what would be the formula for this. For the reference I have added an image below:
I want among (17, 1, 19, 20) I get (17, 19, 20) and the sum (17+19+20) = 56 put it in final marks column
Since you have only 4 items, it can be calculated like this
=SUM(B3:E3) - MIN(B3:E3)
You can use the LARGE() function to get the largest, second largest, third largest values and sum them all up.
=SUM(LARGE(B3:F3,1),LARGE(B3:F3,2),LARGE(B3:F3,3))
I use this to get best 4 out of 6:
=SUM(LARGE(N4:S4,{1,2,3,4}))
which for your three becomes:
=SUM(LARGE(B3:F3,{1,2,3}))
But I don't enter it as an array formula, works just fine dragged down as necessary.

Obtain every nth row of filtered records

I'm looking for information on how to copy nth rows of records from one excel sheet to the next, and now I am wondering if there is a way to do this for filtered data (i.e. I have 400 students enrolled at school, and I want every 15th male whose parents have not graduated from college (flags have been created for both gender and parent education, which I am using to filter on). Are there any ideas on how to do this? If not, I could just use the offset function for each combination of variables I am filtering on, but that's over 30-40 combinations if I did my math right. Thanks for any help you can provide.
There are a few standard formulas used for retrieving the first, second, third, etc set of values that match criteria. I prefer a standard formula model using the INDEX function and SMALL function. By throwing a little maths at the increment to change it from 1, 2, 3 ... to 1, 16, 31, 46, ... you should be able to achieve your offset results. In the following example image, I've used a stagger of 4 rather than 15 in order to accommodate sample data vertically while still producing more than a single result.
        
The formula in F2 is,
=IFERROR(INDEX(A$2:A$999, SMALL(INDEX(ROW($1:$998)+((C$2:C$999<>"M")+(D$2:D$999<>"N"))*1E+99, , ), 1+(ROW(1:1)-1)*4)), "")
For your purposes the 4 in 1+(ROW(1:1)-1)*4 will need to be changed to 15.
=IFERROR(INDEX(A$2:A$999, SMALL(INDEX(ROW($1:$998)+((C$2:C$999<>"M")+(D$2:D$999<>"N"))*1E+99, , ), 1+(ROW(1:1)-1)*15)), "")
Fill down as necessary.
Once you have retrieved a unique identifier, the remainder can be retrieved with a simple VLOOKUP function.

Binning in Excel

Which formulae in MS Excel can we use for -
equi-depth binning
equi-width binning
Here's what I used. The data I was binning was in A2:A2001.
Equi-width:
I calculated the width in a separate cell (U2), using this formula:
=(MAX($A$2:$A$2001) - MIN($A$2:$A$2001) + 0.00000001)/10
10 is the number of bins. The + 0.00000000001 is there because without it, values equal to the maximum were getting put into their own bin.
Then, for the actual binning, I used this:
=ROUNDDOWN(($A2-MIN($A$2:$A$2001))/$U$2, 0)
This function is finding how many bin-widths above the minimum your value is, by dividing (value - minimum) by the bin width. We only care about how many full bin-widths fit into the value, not fractional ones, so we use ROUNDDOWN to chop off all the fractional bin-widths (that is, show 0 decimal places).
Equi-depth
This one is simpler.
=ROUNDDOWN(PERCENTRANK($A$2:$A$2001, $A2)*10, 0)
First, get the percentile rank of the current cell ($A2) out of all the cells being binned ($A$2:$A$2001). This will be a value between 0 and 1, so to convert it into bins, just multiply by the total number of bins you want (I used 10). Then, chop off the decimals the same way as before.
For either of these, if you want your bins to start at 1 rather than 0, just add a +1 to the end of the formula.
Best approach is to use the built-in method:
http://support.microsoft.com/kb/214269
I think the VBA version of the addin (step 3 with most versions) will also give you the code.
Put this formula in B1:
=MAX( ROUNDUP( PERCENTRANK($A$1:$A$8, A1) *4, 0),1)
Fill down the formula all across B column and you are done. The formula divides the range into 4 equal buckets and it returns the bucket number which the cell A1 falls into. The first bucket contains the lowest 25% of values.
General pattern is:
=MAX( ROUNDUP ( PERCENTRANK ([Range], [TestCell]) * [NumberOfBuckets], 0), 1)
You may have to build the matrix to graph.
For the bin bracket you could use =PERCENTILE() for equi-depth and a proportion of the difference =Max(Data) - Min(Data) for equi-width.
You could obtain the frequency with =COUNTIF(). The bin's Mean could be obtained using =SUMPRODUCT((Data>LOWER_BRACKET)*(Data<UPPER_BRACKET)*Data)/frequency
More complex statistics could be reached hacking around with SUMPRODUCT and/or Array formulas (which I do not recommend since are very hard to comprehend for a non-programmer)

Resources