Excel Array formula to count moving average outliers - excel

I've tried a few things on this and settled on a 'cheap' solution. Wanted to know if this can be done directly and more elegantly.
Problem Statement and Sample Data
Assume we have a table in excel with ~200 columns and a large number of rows (~10k).
Sample Data:
identifier
val1
val2
val3
...
val200
ID_1
100
102
34
...
89
We want to add a column at the end that shows us how many "moving average" outliers exist. A moving average outlier is defined as a point that is outside the range (mean - 2 * std deviations, mean + 2 * std deviations), where the mean and std dev is calculated using the previous 10 values (therefore its a moving average outlier).
We will not test the first 10 values. But from val11, the previous 10 values will be used to form the window and we want to test if the value is an outlier.
My Solution so far
I created another table of same dimensions as the original. In cells from val11 (to val200, for all columns), I put in the formula below in the new table. And then, I can simply sum the columns in each row in the new table.
Assume val11 is on X2 in the "shocks" worksheet (for first row):
=IF(OR(shocks!X2<AVERAGEA(shocks!D2:W2)-2STDEVA(shocks!D2:W2),shocks!X2>AVERAGEA(shocks!D2:W2)+2STDEVA(shocks!D2:W2)),1,"")
But if possible, I want to avoid having a second table since it bloats and slows down the file. Any help would be greaty appreciated

Related

Why is my SUMX DAX function returning this result?

Suppose I have 2 tables:
fTransactions
ProdID RepID Revenue
1 1 10
1 1 10
1 2 10
dSalesReps
RepID RepName
1 joe
2 sue
With dSalesReps having the following measures with no filters applied yet:
RepSales:=CALCULATE(SUM(fTransactions[Revenue]))
RepSales2:=SUMX(fTransactions, CALCULATE(SUM(fTransactions[Revenue]))
The first measure performs how I think it would. It goes to the fTransactions table and sums up the Revenue column.
The second measure, after a lot of trial and error to figure it out, seems to sort of group itself on unique rows in fTransactions. In the above example, fTransactions has 2 rows where everything is identical, then a last row where something is different. This seems to result in the following:
(10 + 10) first iteration that sums the first "grouping"
+
(10 + 10) second iteration that sums the first "grouping" again
+
(10) last iteration that sums the second "grouping"
= 20 + 20 + 10 = 50
At least that's how it looks to be operating. I just don't understand why. I thought it would go to the fTransactions table, sum all of Revenue for each iteration, then sum those sums as a final step.
This is caused by something called "context-transition" (see sqlbi more detailed explanation).
In practice, your formula "RepSales" uses a "Row Context" (created by SUMX) which is turned in an equivalent "Filter Context" (by CALCULATE), but since you don't have an unique key in the table, it gets and uses multiple rows in each iteration, below the explanation.
For the first row, the row context is ProdID=1 AND RepID=1, which turned in an equivalent filter context (stays the same, in this case) is ProdID=1 AND RepID=1 but the filter context is global, and two rows (the first 2) match this filter.
This is repeated for each row.
it does not happen with the formula "RepSales" because it does not iterate multiple times (as you already noticed)
This is your current situation:
To prove that, just add a rowID to the transaction table:
It does not happen because the equivalent filter context also include the RowID column, which matches only one row
Hope this helps, use the sqlbi article as a reference, it will be an exhaustive guide to understand this

Why does the DAX formula in my calculated column use propagation to filter in one instance and not in another?

Suppose I have a couple of tables:
fTransactions
Index ProdID RepID Revenue
1 1 1 10
2 1 1 20
3 2 2 30
4 2 2 10
dSalesReps
RepID RepName CC1 CCC2
1 joe 40 70
2 sue 30 70
3 bob 70
CC1 contains a calculated column with:
CALCULATE(SUM(fTransactions[Revenue]))
It's my understanding that it's taking the row context and changing to filter context to filter the fTransaction table down to the RepID and summing. Makes sense per an sqlbi article on the subject:
"because the filter context containing the current product is automatically propagated to sales due to the relationship between the two tables"
CC2 contains a calculated column with:
SUMX(fTransactions, CALCULATE(SUM(fTransactions[Revenue]))
However, this one puts the same value in all the columns and doesn't seem to propagate the RepID like the other example. The same sqlbi article mentions that a filter is made on the entire fTransactions row. My question is why does it do that here and not the other example, and what happened to the propagation of RepID?
"CALCULATE places a filter on all the columns of the table to identify a single row, not on its row number"
A calculated column is created in a loop: power pivot goes row by row and calculates the results. CALCULATE converts each row into a filter context (context transition).
In the second formula, however, you have 2 loops, not one:
First, it loops dSalesReps table (because that's where you are creating the column);
Second, it loops fTransactions table, because you are using SUMX function, which is an iterator.
CALCULATE function is used only in the second loop, forcing context transition for each row in fTransactions table. But there is no CALCULATE that can force context transition for the rows in the dSalesReps. Hence, there is no filtering by Sale Reps.
Fixing the problem is easy: just wrap the second formula in CALCULATE. Better yet, drop the second CALCULATE - it's not necessary and makes the formula slow:
CCC2 =
CALCULATE(
SUMX(fTransactions, SUM(fTransactions[Revenue]))
)
This formula is essentially identical to the first one (the first formula in the background translates to the second one, SUM function is just a syntax sugar for SUMX).
You could also write the formula as:
CC2 = SUMX( RELATEDTABLE( fTransactions ), fTransactions[Revenue] )
or
CC2 = SUMX( CALCULATETABLE( fTransactions ), fTransactions[Revenue] )
The key is that fTransactions as the first argument of SUMX needs to be filtered for each SalesRep (i.e. on the current row). Without the filter then you are just iterating the entire fTransactions table for each SalesRep. Somehow SUMX needs to know you just want the fTransactions for the SalesRep whose revenue you are trying to compute.

Sum of Averages in Excel Pivot Table

I am measuring room utilization (time used/time available) from a data dump. Each row contains the available time for the day and the time used for a particular case.
The image is a simplified version of the data.
If you read the yellow and green highlights (Room 1):
In room 1, there are 200 available minutes on 1/1/2016.
Case 1 took 60 minutes, case 2 took 50 minutes.
There are 500 available minutes on 1/2/2016, and only one case occurred that day, using 350 minutes.
Room 1 utilization = (60 + 50 + 350)/(200 + 500)
The problem with summing the available time is that it double counts the 200 minutes for 1/1/2016, giving: Utilization = (60+50+350)/(200+200+500)
There are hundreds of rows in this data (and there will be multiple data dumps of differing #'s of rows) with multiple cases occurring each day.
I am trying to use a pivot table, but I cannot obtain the 'sum of averages' for a particular room (see image). I am using a macro to pull the numbers out of the grand total column.
Is this possible? Do you see another way to obtain utilization?
(note: there are lots of other columns in the data, like case start, case end, day of week, etc, that are not used in this calculation but are available)
The reason that you're getting 300 for both Average of Available Time columns is because the grand total is a grand total based on the overall average and not a sum of the averages.
Room 1: 200 + 200 + 500 / 3 = 300
Room 2: 300 + 300 + 300 / 3 = 300
I could not comment on the original question, so my solution is based on a few assumptions.
Assumption #1: The data will always be grouped. E.G. All cases in room 1 on a given day will grouped in sequential rows.
Assumption #2: The available time column is a single value for the whole day, there will never be differing available times on the same day.
Solution: Use column E as the Actual Available Time. This column will use a formula to determine if the current row has a unique combination (Date + Room + Available Time) to the previous and if so, the cell will contain that row's available time.
Formula to use in E2:
=IF(AND($A1 = $A2, $B1 = $B2, $C1 = $C2), 0, $C2)
Extend the formula as far down as necessary and then include the new column in your PivotTable data range.
End Result
I created a unique reference by combining columns and then used sumif/countif/countif.
So the formula in column E would be:
=sumif(colB,cellB,ColC)/Countif(colB,cellE)/Countif(colB,cellE)
Doesn't matter if the data is in order or not then.
Extend the formula as far down as necessary and then include the new column in your PivotTable data range.
The easiest method I would recommend is this.
=SUM(H:H)-GETPIVOTDATA("Average of Available Time",$G$3)
The first term sums the H column, and the second term subtracts the grand total value. It is a dynamic solution, and will change to fit the size of the pivot table.
My assumptions are that the Pivot Table was originally placed in cell G3.

Excel IF OR Statement

I am having trouble determining the correct way to calculate a final rank order for four categories. Each of the four metrics make up a higher group. A Top 10 of each category is applied to the respective product to risk analysis.
CURRENT LOGIC - Assignment of 25% max per category.
Columns - Y4
Parts
0.25
25
=IF(L9=1,$Y$4,IF(L9=2,$Y$4*0.9, IF(L9=3,$Y$4*0.8, IF(L9=4,$Y$4*0.7, IF(L9=5,$Y$4*0.6, IF(L9=6,$Y$4*0.5, IF(L9=7,$Y$4*0.4, IF(L9=8,$Y$4*0.3, IF(L9=9,$Y$4*0.2, IF(L9=10,$Y$4*0.1,0))))))))))
DESIRED...
I would like to use a statement to determine three criteria in order to apply a score (1=100, 2=90, 3=80, etc..).
SUM the rank positions of each of the four categories-apply product rank ascending (not including NULL since it's not in the Top 10)
IF a product is identified in more than one metric-apply a significant contribution weight of (*.75),
IF a product has the number 1 rank in any of the four metrics-apply a score of (100).
Data - UPDATED EXAMPLE
(Product) Parts Labor Overhead External Final Score
"XYZ" 3 1 7 7 100
"ABC" NULL 6 NULL 2 100
"LMN" 4 NULL NULL NULL 70
This is way beyond my capability. ANY assistance is appreciated greatly!!!
Jim
I figured this is a good start and I can alter the weight as needed to reflect the reality of the situation.
=AVERAGE(G28:I28)+SUM(G28:I28)*0.25
However, I couldn't figure out how to put a cap on the score of no more than 100 points.
I am still unclear of what exactly you are attempting and if this will work, but how about this simple matrix using an array formula and some conditional formatting.
Array Formula in F2 (make sure to press Ctrl+Shift+Enter when exiting formula edit mode)
=MIN(100,SUM(IF(B2:E2<>"NULL",CHOOSE(B2:E2,100,90,80,70,60,50,40,30,20,10))))
Conditional Formatting defined as shown below.
Red = 100 value where it comes from a 1
Yellow = 100 value where it comes from more than 1 factor, but without a 1.

Find the top n values in a range while keeping the sum of values in another range under x value

I'd like to accomplish the following task. There are three columns of data. Column A represents price, where the sum needs to be kept under $100,000. Column B represents a value. Column C represents a name tied to columns A & B.
Out of >100 rows of data, I need to find the highest 8 values in column B while keeping the sum of the prices in column A under $100,000. And then return the 8 names from column C.
Can this be accomplished?
EDIT:
I attempted the Solver solution w/ no luck. 200 rows looks to be the max w/ Solver, and that is what I'm using now. Here are the steps I've taken:
Create a column called rank RANK(B2,$B$2:$B$200) (used column D -- what is the purpose of this?)
Create a column called flag just put in zeroes (used column E)
Create 3 total cells total_price (=SUM(A2:A200)), total_value (=SUM(B2:B200)) and total_flag (=(E2:E200))
Use solver to minimize total_value (shouldn't this be maximize??)
Add constraints -Total_price<=100000 -Total_flag=8 -Flag cells are binary
Using Simplex LP, it simply changes the flags for the first 8 values. However, the total price for the first 8 values is >$100,000 ($140k). I've tried changing some options in the Solver Parameters as well as using different solving methods to no avail. I'd like to post an image of the parameter settings, but don't have enough "reputation".
EDIT #2:
The first 5 rows looks like this, price goes down to ~$6k at the bottom of the table.
Price Value Name Rank Flag
$22,538 42.81905675 Blow, Joe 1 0
$22,427 37.36240932 Doe, Jane 2 0
$17,158 34.12127693 Hall, Cliff 3 0
$16,625 33.97654031 Povich, John 4 0
$15,631 33.58212402 Cow, Holy 5 0
I'll give you the solver solution as a starting point. It involves the creation of some extra columns and total cells. Note solver is limited in the amount of cells it can handle but will work with 100 anyway.
Create a column called rank RANK(B2,$B$2:$B$100)
Create a column called flag just put in zeroes
Create 3 total cells total_price, total_value and total_flag
Use solver to minimize total_value
Add constraints
-Total_price<=100000
-Total_flag=8
-Flag cells are binary
This will flag the rows you want and you can grab the names however you want.

Resources