Sum column based on conditions for subsums - excel

So I have a table which basically looks as follows:
Criterion Value
1 -5
1 1
2 5
2 5
3 2
3 -1
I want to sum the values in column B based on the criteria in column A, but only if the sum for an individual criterion is not negative. So for example if I ask for the sum of all values where criterion is between 1 and 3, the result should be 11 (the values for criterion 1 not being included in the sum because they add up to a negative number.
My first idea was to add a third column with a sumif([criterion];[#criterion];[value]) and then use a sumifs function which checks whether that that third column is negative. However, my table has +100k lines and with that many sumif functions it becomes intolerably slow.
I know I could create a pivot table to the same effect, but that has two drawbacks: I would have to create a separate sheet, which would add complexity, and my table is frequently updated which means I would have to manually update that pivot table every time to allow for downstream calculations. NBD and I could do that as a last resort, but I wonder whether there isn't a more elegant way to solve this problem.
I would want to avoid VBA to avoid complexity (the sheet will be used by other persons).
Thank you

This can be easily done using UNIQUE() and the two versions of SUMIF() in this way:
First collect all the criteria with =UNIQUE(A2:A7) -- Assuming your data are in columns A and B starting from row 2, this goes in cell C2, with "Criteria" in C1
Compute the subtotals for all criteria using =SUMIF($A$2:$A$7, C2, $B$2:$B$7) -- This goes in cell D2 and extends as the criteria do, "Partials" in cell D1
sum all the data in step 2 yielding a positive sum with =SUMIF(D2:D7, ">0") in cell E2
If you have a lot of data I suggest to use the column references to avoid absolute references and the need to adjust the formulas as data change (in number):
The first formula becomes =UNIQUE(A:A) -- Don't care about the heading being taken (strings and empty cells are not summed)
For the second formula use =SUMIF(A:A, C2, B:B)
Use =SUMIF(D:D, ">0") for the last step
This should be reasonably fast, using just as many extra cells as the number of distinct criteria (multiplied by 2).

Related

Increment by 2 for n rows, increment by 4 once and repeat when referencing data from one sheet to another

thank you for taking the time to look at this question.
I'm looking for an equation that can easily take the numerical values from Sheet 1 (the first picture) which has 2 blank cells in between values for four values and then has 4 blank cells and then the other four values. I'm not sure if I am making sense but hopefully the picture I have attached helps.
Notice 2 blank rows between first 4 rows with values (Rows 2-11) and same between rows 16 and 25.
Also notice the 4 blank rows between the two sets of values.
For me, this is repeated for 700 values, same set up of 2 blank rows for 4 sets of values and then 4 blank rows and then four sets of values with 2 blank rows. I'm sure there is an easier way to do this.
I'm trying to recreate Sheet 2 from Sheet 1 using an equation. Is this possible?
Apologies in advance, English isn't my first language.
If the numbers are going to start in B2 and the intervals and offset staggers are static then,
=INDEX(B:B, 2+(ROW(1:1)-1)*3+INT((ROW(1:1)-1)/4)*2)
If the first number is in S6 then,
=INDEX(S:S, 6+(ROW(1:1)-1)*3+INT((ROW(1:1)-1)/4)*2)
Put this in D2:
=IFERROR(INDEX(Sheet1!B:B,AGGREGATE(15,6,ROW(Sheet1!$B$2:INDEX(Sheet1!B:B,MATCH("ZZZ",Sheet1!A:A)))/(Sheet1!$B$2:INDEX(Sheet1!B:B,MATCH("ZZZ",Sheet1!A:A))<>""),ROW(1:1))),"")
And copy down till you get blanks.
This will return the numbers in order that they appear on sheet 1.
The Sheet1!$B$2:INDEX(Sheet1!B:B,MATCH("ZZZ",Sheet1!A:A)) set the data set bounds. This being an array type formula it needs to reference the smallest possible data set. This part finds the last cell in Column A and sets that as the extent of the data set so we do not do unnecessary iterations.
The MATCH part will return the last row that has text in it, if Column A has numbers then we need to change the "ZZZ" to 1E+99 to get the last row in column A with a number.
The AGGREGATE is working like a small in that it will create an array of row numbers and Errors. It will return ROW Numbers where (Sheet1!$B$2:INDEX(Sheet1!B:B,MATCH("ZZZ",Sheet1!A:A))<>"") return true. And an Error where it returns FALSE.
The second criterion 6 in Aggregate tells it to ignore the errors, so it is only looking at the returned row numbers.
The ROW(1:1) is a counter. As the formula is dragged down it will iterate to 2 then 3 and so on. This tells the Aggregate that you want the 1st then the 2nd then the 3rd and so on.
The chosen row number is then passed to the INDEX and the correct value is returned.
If your numbers are in order (smallest to largest like your example) or you want the output in order(smallest to largest) then you can use this simple equation in D2:
=IFERROR(SMALL(Sheet1!B:B,ROW(1:1)),"")
Then copy down till you get blanks.
Here is another formula you might use.
=INDIRECT(ADDRESS((INT((ROW()-ROW($A$2))/4)*14+ROW(A$2))+(MOD(ROW()-ROW($A$2),4)*3),COLUMN($A$2),1,1,"Sheet1"))
You can paste it to the first cell where you want the result and copy down.
Note that $A$2 is the cell from where all the counting starts. If your data start from A3 you can change the references accordingly. Note further that ROW($A$2) is long for 2. I chose this syntax to enable you to identify the meaning.
COLUMN($A$2), on the other hand, just identifies Column A as the source of the data to be lifted. Row 2 in this formula is insignificant. It's the A that counts. However, COLUMN($A$2) is long for just 1, meaning column No. 1, meaning A. Once you get your bearing in the formula you can replace COLUMN($A$2) with 1.

Dynamically build array of values by indirect column reference

I'm building a yearly scorecard (sample shown above). The requirements for the scorecard are listed below.
Year to Date values must cumulatively add each of the previous period values (circled in orange).
P1 = P1
P2 = P1 + P2
P3 = P1 + P2 + P3 (etc)
Year to Date formulas must all be the exact same, dynamically referencing the required columns and required rows so that they can be easily copied from period to period (on going).
With this formula I was trying looking at row 2 with each of the column indicators in it, and trying to test for ISTEXT() to add up the values in the ROW()-1. Using concatenate to build a string that references a row range might not be the best way to do it.
Example: If I have values in row 55
=SUM(INDIRECT(CONCATENATE(ROW()-1, ":", ROW()-1)))
=SUM(INDEX(INDIRECT(CONCATENATE(ROW()-1, ":", ROW()-1)),MATCH(ISTEXT(2:2),2:2,0)))
I was trying something like a horizontal sumifs() formula with little luck, attempting to use the modulus value of the column() function as a logical test.
formula doesn't work
=SUMIFS(INDIRECT(ROW()-1&":"&ROW()-1), MOD(COLUMN()-2, 6), 0)
Or Using some other method of testing which columns to add.
=SUMIFS(INDIRECT(CONCATENATE(ROW()-1, ":", ROW()-1)), IF(ISTEXT(2:2), 1, 0), TRUE)
If I change my lettering in Row 2 (N, H, T) to just "X" then test for X that works, but this formula doesn't factor in the requirement for only adding values from current and prior periods.
=SUMIFS(INDIRECT(CONCATENATE(ROW()-1, ":", ROW()-1)),2:2, "X")
I don't know of a way to accomplish adding up a dynamic number of indirect cell references based on the column you're in. So lets say its row 55 in period 3, I would need a formula that looks in row 2, sees each of the column values (H, N, T) and adds up H55, N55, T55). That same formula would need to construct a different list based on if its in period 2. (H, N), (H55, N55).
Maybe I need to rethink my approach entirely? Write VBA instead?
Edit
To better expand on what the data model is, to address some comments, I've thrown some dummy values and dirty formulas in.
Have a look at service level vs. service level year to date (YTD). Service level is just a flat data entry of weekly performance, then the Summary column is a simple average of the weekly performance in order to report period performance. The YTD number is an average of the period numbers, so these values progressively roll up.
The formulas I'm trying to write are for the summary columns, both period value and YTD values.
It's not entirely clear what your data layout is.
So, assuming:
Labels that identify columns to sum are in row 2
Values to sum are in row 55
Formula is to sum values in row 55, which have a non-blank entry in row 2, and sum values in columns up to and including the column the formula is in
Formula
=SUMPRODUCT($55:$55,--(COLUMN($55:$55)<=COLUMN()),--($2:$2<>""))
For column T use:
=SUM(IF(MOD(COLUMN($H:T),6)=2,$H$1:T$1,0))
This is an array formula and must be confirmed with Ctrl+Shift+Enter.
change the $H$1:T$1 to the rownumber you need to sum (it will only sum every sixth column starting with H)
Having UPEH at Row 9 and this code in row 10 then =SUM(IF(MOD(COLUMN($H:T),6)=2,$H$9:T$9,0))
If set correct one time you can copy paste it as you need it (as long as it stays with just sum every 6th column starting at H)
for making it more dynamically you may better use:
=SUM(IF($A$4:T$4="Summary",$A$9:T$9,0))
This is an array formula and must be confirmed with Ctrl+Shift+Enter.
it checks for Row 4 to contain "Summary" to get the values to sum :)
EDIT
However, if you want to have exactly the same formula in each part you would need to use something like that:
=SUM(IF(AND($4:$4="Summary",COLUMN($4:$4)<=COLUMN(),OFFSET($1:$1,ROW()-2,),0))
This is an array formula and must be confirmed with Ctrl+Shift+Enter.
it sums all the cells 1 row over itself from the beginning till (including) the own column for all columns containing "Summary" in row 4
however, this may get pretty slot pretty fast (calcs a LOT) ^^
BIG HINT: Just looking at what you have/need
lets asume the cells to add are in row 1 and the output in row 2...
we also skip the columns not to calculate (to make it easy)...
A2 would be just A1
B2 would be A1 + B1
C2 would be A1 + B1 + C1... but wait!
A1 + B1 = B2 so better -> C2 = B2 + C1
leads to:
R2Cx = R2C(x-1) + R1Cx
if you just use that behavior in column N (that it is the value over it and the calculated value to the left (column H)) and also write it that way, you could just copy it and paste it in column T and you will get =T(above) + N(calculated). check it :)

Create a dynamic 'if' statement in Excel without VBA

* Updated *
I have a rather large excel data set that I'm trying to summarise using up to 3 dimensions: region, sector, industry.
Any combination of these dimensions can be set or left blank and I need to create a formula that accommodates this WITHOUT using VBA.
Within the data I've set up named ranges to refer to these dimensions.
I'm using an array formula but I'd like to dynamically create a string which is then used as the boolean argument in the array formula.
For instance if:
A1 = "Hong Kong" (region)
B1 = <blank> (sector)
C1 = <blank> (industry)
I create a dynamic string in D1 such that
D1 = (region="Hong Kong")
I then want to use the string in D1 to create an array formula
E1 = {counta(if(D1,employees))}
However, if the user includes a sector such that:
A2 = "Hong Kong" (region)
B2 = "finance" (sector)
C2 = <blank> (industry)
Then I want the string in D2 to update to:
D2 = (region="Hong Kong")*(sector="finance")
Which then automatically updates the value in E2 which still has the same formula.
E2 = {counta(if(D2,employees))}
Is this possible? Alternatively is there any other way of achieving the same outcome, keeping in mind that I need to be able to copy D1 and E1 down into different rows so that different combinations of the dimensions can be viewed simultaneously.
Thanks.
* Updated *
To be clear, the reason the I need the values in column D to be dynamic is so that I can create different scenarios in Row 1, Row 2, Row 3 etc. and I need the values in column E of each row to match the criteria set in columns A:C of that row.
There had to be a fairly simple way!
Columns B:D contain the criteria, A is a criterion number and E is the result of applying the DSUM function to the criterion in that row. I've used DSUM as it seems more natural (to me at least) to sum employee numbers. However, DCOUNT can equally well be used. For brevity I've not shown the data I'm using but it is a very trivial data set with just a few rows of test data.
The first set of criteria in row 2 is: Sector takes value of "Man" (manufacturing) whilst Region and Industry are unspecified. The 3rd set of criteria (in row 4) is: the Region is "Fr" (for France) AND the Industry is "Cars". The results in the DSUM column are obtained by applying the set of criteria in the corresponding row. All, some or even none of the cells in a row may contain entries.
The approach used is based on columns G:J, where with the exception cells G1 and G2 (which contain the numbers 0 and 1, respectively) everything in these columns has been generated by a formula.
There are twice as many rows in columns G:J as there are sets of criteria listed in B:D and the rows should be taken in pairs. The first pair (rows 1 and 2) provide a criterion table for use in DSUM corresponding to the first set of criteria (the table is cells H1:J2), the second pair in rows 3 and 4 provides a criterion table for the second set of criteria (cells H3:J4), etc. (Ignore the 11th row - I copied too many rows downwards in the screenshot!)
Column G has a fairly obvious pattern and can be generated by applying a simple =IF() function in cell G3 which references the starting pair in G1 and G2 with the formula in G3 then copied downwards.
The cells in columns H:J reference the appropriate cells of the set of all criteria (B1:D6 in the screenshot) using the INDEX function (and making use of the value in column Galong the way). It is not too difficult to create a single formula that can be copied from H1 to the range H1:J11 by judicious use of mixed relative and absolute addressing and an IF or two). Note that references to an empty cell in B2:D6 will generate a value of 0 in the corresponding cell in H:J so the construct IF(x=0,"",x) must be used - this makes the formula used in the cells in columns H:J a bit clunky but not excessively so.
Having generated the 5 criteria tables corresponding to the 5 sets of criteria in B:D, use is made of the OFFSET function to deliver the correct criterion table as the third argument of the DSUM functions in column E.
I chose to base my OFFSETs on cell $H$1, so the top-left cell of the criterion table for the first set of criteria is offset from my base cell by 0 rows and 0 columns. The second criterion table is offset by 2 rows and 0 columns, the third by 4 rows and 0 columns. It should be clear how the number of offset rows and columns to use can be calculated from the corresponding criterion number in column A. It should also be obvious that the final two arguments of the OFFSET function will always be 2 and 3. So my DSUM() functions in column E look something like
=DSUM(myData,"Employees",OFFSET($H$1,row_offset,0,2,3))
where myData is the named range containing the test dataset and row_offset is a very simple formula involving the corresponding value in column A.
It would have been nice to have been able to deliver the third argument of the function without having to adopt the approach of effectively reproducing the sets of criteria in B1:D6 in cells H1:J10. Whilst there are ways to generate the required criterion table arrays formulaically without putting them onto the worksheet, I found that DSUM generated an error when applying such an array as its third argument.

Need formula operating against a dynamic range copied across a series of cells

I'm creating a grid of correlation values, like a distance grid. I have a series of cells that each contain a formula whose ranges are easy to describe if you know the offset from the first cell, and I'm having trouble figuring out how to specify it.
In the upper left hand cell (R10), the formula is CORREL(C2:C21,C2:C21) -- it's 1, of course.
In the next column over (S10), the formula is CORREL(D2:D21,C2:C21).
In the next row down (R11), the formula is CORREL(C2:C21,D2:D21).
Of course, S11 would contain CORREL(D2:D21,D2:D21), which is also 1. And so on, for a roughly 15x15 grid.
Here's a graphical representation of the ranges involved:
C2:C21,C2:C21 C2:C21,D2:D21 C2:C21,E2:E21
D2:D21,C2:C21 D2:D21,D2:D21 D2:D21,E2:E21
E2:E21,C2:C21 E2:E21,D2:D21 E2:E21,E2:E21
Whenever I add a new data row, I have to manually update several formulas. So, I'd like the last non-blank column number (21, in this case), to be dynamically determined, such as with COUNTA(C:C). Ideally, I'd like the formula to calculate the row offsets, too, so that I can drag one formula across my entire range.
What's the best way to accomplish this? I think OFFSET might be a component in the solution, but I haven't had success getting it all to work together.
Using this simple setup per element of the corr matrix also helps:
=CORREL(INDIRECT("'Risk factors'!"&"T"&G6&":T"&H6);INDIRECT("'Risk factors'!"&"U"&G6&":U"&H6))
With this function I refer to data in another sheet, Risk factors, to correlate rows T and U with each other. I want the ranges of the data to be dynamic so I refer with G6 and H6 in my current sheet to the lenght of the columns (number of rows) which I of course specify in these G6 and H6 cells.
Hope this helps!
I found this formula, while wordy, achieved the desired results. In this example, the data lives in C2:O19. The table I wanted to construct computed the correlation values of all permutations of pairs of columns. Since there are 11 columns, the correlation pairs table is 11x11 and starts at R10. Each cell has the following formula:
=CORREL(INDIRECT(ADDRESS(2,2+(ROWS($R$10:R10)),4)&":"&ADDRESS(COUNTA($C:$C),
2+(ROWS($R$10:R10)),4)),INDIRECT(ADDRESS(2,2+(COLUMNS($R$10:R10)),4)&":"&
ADDRESS(COUNTA($C:$C),2+(COLUMNS($R$10:R10)),4)))
As I found out, INDIRECT() resolves a cell reference and obtains its value.
Let's take a cell, say U12, and look at the range formula in detail. The first INDIRECT is the column given by applying the row offset from R10.
Since Row 12 is 2 rows down from Row 10, ADDRESS(2,2+(ROWS($R$10:U12)),4)&":"&ADDRESS(COUNTA($C:$C),2+(ROWS($R$10:U12)),4) should yield the column that's 2 rows right of Row C, which is E. The formula evaluates to E2:E19.
The second INDIRECT is the column given by applying the column offset from R10. Similarly, since Column U is 3 columns right of Column R, ADDRESS(2,2+(COLUMNS($R$10:U12)),4)&":"&ADDRESS(COUNTA($C:$C),2+(COLUMNS($R$10:U12)),4) should yield the column that's 3 rows right of Row C, which is F. The second formula evaluates to F2:F19.
Substituting these range reference values in, the cell formula reduces to =CORREL(INDIRECT("E2:E19"),INDIRECT("F2:F19")) and further to =CORREL(E2:E19,F2:F19), which is what I'd been using up till now.
Just like a distance table, this table is symmetrical along the diagonal, because =CORREL(E2:E19,F2:F19) equals =CORREL(F2:F19,E2:E19). Each value on the diagonal is 1, because CORREL of the same range is 100% correlation by definition.

Compare two data sheets

The issue I'm faced with is I have two sheets of data in Excel. They are a stocksheet list, listing items that have a variance from a stocktake. The items are randomly placed between both documents, so it is almost impossible to do a side-by-side view even if I were to order the columns (which I already have). For example it would be like this:
Sheet 1:
A1 (Apple) (1)
A2 (Carrot) (-3)
A3 (Banana) (4)
A4 (Chocolate (-7)
Whereas Sheet 2 may be:
A1 (Orange) (-2)
A2 (Apple) (3)
A3 (Muffin) (-8)
A4 (Carrot) (3)
So as you can see, the same data may appear, and if it does I want to compare those two sets, to know the variance, i.e. Sheet 1 said -3 whereas sheet 2 said +1... I preferably would like to do this in a batch if possible, as there are over 800 cells to go through.
Just so that you can see what I'm dealing with, here's links to pastebins of both sheets;
Sheet 1: http://pastebin.com/6i7QKJ6N
Sheet 2: http://pastebin.com/zjtC2U7q
Is there anything anyone can think of that would be able to assist me, other than me going through this one by one which I am considering doing?
Excuse me from avoiding the real situation and sticking with your example. Assuming the values are in ColumnB in the corresponding rows, then:
in Sheet1: =VLOOKUP(A1,Sheet2!A:B,2,FALSE)
in Sheet2: =VLOOKUP(A1,Sheet1!A:B,2,FALSE)
say in ColumnsC should 'align' the entries (where both exist, otherwise #N/A). =B1=C1 in D1 copied down should then help to identify the mismatches and say =B1-C1 in E1 copied down the quantification the discrepancies between the sheets, by 'vegetable'.
There should be no need for a batch mode for this.
I'm assuming that the unique identifier for the stock items is the column labelled CYSKU, right?
If that's so, then there are only 192 common items between the two sheets. I ran a vlookup in both sheets a bit similar to the one pnuts used and used a filter.
There are more variances between CYCOST than with CYRETL as far as I can see (I haven't compared the other columns).
To perform the comparison, you can do the following:
Insert a column between columns C and F (just after CYSKU) and put a vlookup formula in row 2 of this column and fill it down:
=VLOOKUP(C2, Sheet2!C:C, 1, 0)
Insert a filter and filter out #N/A from this column to get only those that are common between the two sheets.
In column M (after CYDVAR), insert another vlookup and fill it down:
=VLOOKUP(C2, Sheet2!C:F, 4, 0)
This will give you the corresponding CYRETL from Sheet2. You can then compare the two CYRETL.
How VLOOKUP works:
The first parameter is what VLOOKUP will be looking for.
The second parameter is the table range in which to look the first parameter.
The third parameter is the nth column from which a match will be returned, limited to the table (if the table is in column A:A, only 1 column is available, if the table is A:B, 2 columns are available, etc).
The last parameter is for either exact or approximate match. Exact is 0 (or FALSE) and approximate is 1 (or TRUE).
You can just change the table range and the column number to change the value you're looking for from Sheet2.

Resources