Optimizing Excel formulas - SUMPRODUCT vs SUMIFS/COUNTIFS - excel

According to a couple of web sites, SUMIFS and COUNTIFS are faster than SUMPRODUCT (for example: http://exceluser.com/blog/483/excels-sumifs-or-sumproduct-which-is-faster.html). I have a worksheet with an unknown number of rows (around 200 000) and I'm calculating performance reports with the numbers. I have over 6000 times almost identical SUMPRODUCT formulas with a couple of difference each times (only the conditions change).
Here is an example of what I got:
=IF(AFO4>0,
(SUMPRODUCT((Sheet1!$N:$N=$A4)
*(LEFT(Sheet1!$H:$H,2)="1A")
*(Sheet1!$M:$M<>"service catalog")
*(Sheet1!$J:$J="incident")
*(Sheet1!$I:$I<>"self-serve")
*(Sheet1!$AK:$AK=AFM$1)
*(Sheet1!$E:$E>=$E$1)
*(Sheet1!$E:$E<$E$2))
+SUMPRODUCT((Sheet1!$AJ:$AJ=$C4)
*(LEFT(Sheet1!$H:$H,2)="1A")
*(Sheet1!$M:$M<>"service catalog")
*(Sheet1!$J:$J="incident")
*(Sheet1!$I:$I="self-serve")
*(Sheet1!$AK:$AK=AFM$1)
*(Sheet1!$E:$E>=$E$1)
*(Sheet1!$E:$E<$E$2)))/AFO4,0)
Calculating that thing takes a little bit more than 1 second. Since I have more than 6000 of those formulas, it takes a little bit over an hour to calculate everything.
So, I'm now looking at how I could optimize that formula. Could I convert it to SUMIFS? Would it be faster? All I'm adding up here is 0s and 1s, I'm just counting the number of rows in my data source (Sheet1) where the set of conditions is met. Maybe COUNTIFS would work better?
I would appreciate any help to gain some execution time since we need to execute the formulas every month.
I can use VBA if that helps, but I always heard that Excel formulas were usually faster.

Instead of formulas, Why not use a PivotTable to crunch the numbers? You potentially face a longer one-time hit to load the data into the PivotCache, but after that, you should find a PivotTable recalculates much faster in response to filter changes than these computationally expensive formulas. Is there any reason you're not using one?
Here's some content from a book I'm writing, where I compare SUMPRODUCT, SUMIFS, DSUM, PivotTables, the Advanced Filter, and something called Range Slicing (which uses clever combinations of INDEX/MATCH on sorted data) to conditionally sum the records in a table that contains over 1 million sales records, based on choices you make from 10 different dropdowns:
Those dropdowns allow you to filter the database by a combination of the Store, Segment, Species, Gender, Payment, Cust. History, Order Status, Delivery Instructions, Membership Type, and Order Channel columns. So there’s some pretty mammoth filtering and aggregation going on in order to reduce those 1 million records down to just one sum. The file outlines six different ways to achieve this outcome, the first three of which are shown in the screenshot below:
As you’d expect, when all those dropdowns are set to the same settings, you get exactly the same answer out of all six approaches. But what you won’t expect is just how slow SUMPRODUCT is to calculate a new answer if you change one of those dropdowns, compared to the other approaches.
In fact, it turns out that the SUMIFS approach is 15 times faster than the SUMPRODUCT one at coming up with the answer on this mammoth dataset. But that’s nothing: The range slicing approach is 56 times faster!
The Range Slicing approach works by sorting your source data, and then using a series of clever formulas in helper columns to cleverly identifying exactly where any records of interest sit within that sorted data. This means that you can then directly sum just the few records that match rather than having to do a complex criteria match against hundreds of thousands of rows (or against a million rows, as in the example here).
Here’s how that looks in terms of my sample file. The number in the Rows helper column on the right-hand side shows that through some clever elimination, the SUM function at the bottom has to process only 18 rows of data (rows 292996 through 293014) rather than all 1 million rows. In other words, this is mighty efficient.
And here’s the second group of alternatives:
Yup, you can quite easily use a PivotTable here. And the PivotTable approach seems to be around 6 times faster than SUMPRODUCT—although you get a small amount of extra delay when calling up the filters, and the first time you perform a filter operation it takes quite a bit longer again, as Excel must load the PivotCache into memory. But let’s face it: Setting up the PivotTable in the first place is the easiest of any of these approaches, so it has my vote.
The DSUM approach is 12 times faster than SUMPRODUCT. That’s not as good as SUMIFS, but it’s still a significant improvement. The Advanced Filter approach is only 4 times faster than SUMPRODUCT—which isn’t really surprising because what it does is grab an extract of all records from the source data that match the criteria in that list, dump it into the spreadsheet, and then sum the result.

1st SUMPRODUCT could become
=COUNTIFS(Sheet1!$N:$N,$A4,Sheet1!$H:$H,"1A*",Sheet1!$M:$M,"<>service catalog",Sheet1!$J:$J,"incident",Sheet1!$I:$I,"<>self-serve",Sheet1!$AK:$AK,AFM$‌​1,Sheet1!$E:$E,">="&$E$1,Sheet1!$E:$E,"<"&$E$2)
The LEFT part can be handled by a wildcard, as shown
change the second part along the same lines

Related

Looking for a better formula in Excel to Calculate Duplicates

I have 621224 * 1 data in my Excel Sheet and have to calculate total no. of duplicates in the sheet but it takes a lot too much time with =IF(COUNTIF($J$1000:J14353,J5353)>1,1,0) so this formula might be taking n^2 complexity to find duplicates, I am looking for a formula that takes less time and if possible takes nlogn time, if there is in Excel
As of now I am doing this task manually taking a range of 10k which works in acceptable time and also to add on I have sorted the list
I searched for vlookup and found it will take around same time as countif
If you've sorted the data then you can use binary searching, which will be many times faster than your current linear set-up with COUNTIFS. For example:
=SUMPRODUCT(N(MATCH(A1:A750000,A:A)<>ROW(A1:A750000)))
On my machine, this produced a result in less than 1 second.
If you aren't able to first sort the data, if you have Office 365 you can perform the sorting in-formula:
=LET(ζ,A1:A750000,ξ,SORT(ζ),SUMPRODUCT(N(MATCH(ξ,ξ)<>SEQUENCE(ROWS(ζ)))))
which should still be very fast.
Ignoring IF() may faster. Try-
=SUM(--(COUNTIFS(A3:A13,A3:A13)>1))

Using Sumproduct to calculate two tables using horizontal (table headers) and vertical references

Hopefully the title makes some sense because I'm trying to wrap my head around the logic and I'm not quite sure how to phrase the question.I'll try to give a brief explanation of the end goal without over complicating it with unnecessary details.
I have a table of survey score averages for every month per person and a correlating table with the number of surveys each person received for each month. The logic is essentially multiple the score for each month by the number of surveys, combine them, divide by the total number of surveys within that time period to get their true average. Where things get a little complicated is that I have to include the ability to set a custom date range and return the value. So sometimes I might be looking at the average for Jan - Apr, other times I might just be looking at Feb-Mar etc.
I think sumproduct is going to get what I need done but I'm running into issues trying to write it out. I've written it several different ways and none of them worked so here's one that best conveys what I'm trying to do,
=SUMPRODUCT(--(F7:I7,L7:O7>=C2),--(F7:I7,L7:O7<=C3),--(E8:E12,K8:K12=B9),tbl_average[[Jan-20]:[Apr-20]],tbl_surveys[[Jan-20]:[Apr-20]])
I super appreciate any assistance I can get on this. I'm hoping the end result is not nearly as difficult as I'm making it out to be.
Some additional information:
I'm going to be using this same process to calculate multiple metrics across multiple worksheets.In the test example each of the tables will most likely be on different sheets. The dashboard with the calculated results will contain everyone's names and will be filtered and rearranged frequently, so I need to make sure we're always matching directly to their names and not just the relative rows. Basically, in my example I show that Agent 1 is always lined up on row 8 but that's not always going to be the case. Agent 1 could be in Row 8 on Sheet 1, Row 10 on Sheet 2, and Row 12 on Sheet 3 and I need all the correct values to multiply and sum against one another.

Summing up cohort behavior cumulatively by date ranges without offsets in excel

I think this problem, when solved by creating additional charts with offsets, is easy. I want to cut out the middle man and not use offsets (unless they are useful to the answer). I have data for daily cohorts and I know specific information about their behavior 1 day later, 2 days, 3 days ect.
Now it is rather easy to make a waterfall chart of day by day activity like so...
What I want to do is skip this step (directly above, the waterfall chart) in hopes of shrinking my current workbook by a substantial amount. You can imagine having simply 1 year of data across multiple channels measuring even 1 aspect of behaviors can account for a lot of data and pivot charts. Also, btw, I have the top chart as a pivot thus allowing this to be hands off when calculating what I am looking for.
What I seek - I look to further construct groups of days as other cohorts to examine (for example, say, 1/1 - 1/5) and see what their activity has been in a cumulative fashion since then. To be more specific, I want a table that will show cohort 1/1-1/5's activity in the date range 1/1-1/5 (11) and then their activity from 1/1-1/9 (24, an additional 13 "behavior points" summed).
So far, as I said, my current solution involves the "blue arrow" schematic where an additional table is constructed and I can sum on, essentially, rectangles build by using OFFSET on sell ranges with the MATCH function. I am stumped with how to go about this without the additional charts.
Thanks!
VBA would better for this, but use this formula in C30:
=IFERROR(SUM(SUMIF(OFFSET(OFFSET($B$1,$A30-MIN($B$2:$B$10),MIN($B$2:$B$10)-$A30+DAY(C$28)),SEQUENCE($B30-$A30+1),IF(COLUMN(OFFSET($B$1,$A30-MIN($B$2:$B$10),MIN($B$2:$B$10)-$A30+DAY(C$28)))-SEQUENCE($B30-$A30+1,,0)>COLUMN($B$1),0-SEQUENCE($B30-$A30+1,,0),COLUMN($B$1)-COLUMN(OFFSET($B$1,$A30-MIN($B$2:$B$10),MIN($B$2:$B$10)-$A30+DAY(C$28)))+1),1,IF(COLUMN(OFFSET($B$1,$A30-MIN($B$2:$B$10),MIN($B$2:$B$10)-$A30+DAY(C$28)))-SEQUENCE($B30-$A30+1,,0)<=COLUMN($B$1),(C$29-C$28+1)-SEQUENCE($B30-$A30+1,,-(COLUMN(OFFSET($B$1,$A30-MIN($B$2:$B$10),MIN($B$2:$B$10)-$A30+DAY(C$28)))-COLUMN($B$1)-1)),C$29-C$28+1)),"<>")),0)
and this in D30:
=C30+SUM(SUMIF(OFFSET(OFFSET($B$1,$A30-MIN($B$2:$B$10),MIN($B$2:$B$10)-$A30+DAY(D$28)),SEQUENCE($B30-$A30+1),IF(COLUMN(OFFSET($B$1,$A30-MIN($B$2:$B$10),MIN($B$2:$B$10)-$A30+DAY(D$28)))-SEQUENCE($B30-$A30+1,,0)>COLUMN($B$1),0-SEQUENCE($B30-$A30+1,,0),COLUMN($B$1)-COLUMN(OFFSET($B$1,$A30-MIN($B$2:$B$10),MIN($B$2:$B$10)-$A30+DAY(D$28)))+1),1,IF(COLUMN(OFFSET($B$1,$A30-MIN($B$2:$B$10),MIN($B$2:$B$10)-$A30+DAY(D$28)))-SEQUENCE($B30-$A30+1,,0)<=COLUMN($B$1),(D$29-D$28+1)-SEQUENCE($B30-$A30+1,,-(COLUMN(OFFSET($B$1,$A30-MIN($B$2:$B$10),MIN($B$2:$B$10)-$A30+DAY(D$28)))-COLUMN($B$1)-1)),D$29-D$28+1)),"<>"))
And copy both down.
If one does not have the dynamic Array formula SEQUENCE() then replace all the SEQUENCE($B30-$A30+1) and SEQUENCE($B30-$A30+1,,0) with ROW($ZZ$1:INDEX($ZZ:$ZZ,$B30-$A30+1)) and (ROW($ZZ$1:INDEX($ZZ:$ZZ,$B30-$A30+1))-1) Respectively, and use Ctrl-Shift-Enter instead of Enter when exiting edit mode.
I was able to collaborate on a solution. I am told that it will be highly inefficient at scale but it gets the job done. It ss less automation-friendly but can be formulated to capture data not currently present on a, say, a pivot table that you call to update later by extending the area that the formula works on.
Formula in in I31:
=SUM(IF(($C$1:$O$1+OFFSET($B$2,$G31-$B$2,0):OFFSET($B$2,$H31-$B$2,0))>=I$29,OFFSET($C$2,$G31-$B$2,0):OFFSET($O$2,$H31-$B$2,0)))-SUM(IF(($C$1:$O$1+OFFSET($B$2,$G31-$B$2,0):OFFSET($B$2,$H31-$B$2,0))>I$30,OFFSET($C$2,$G31-$B$2,0):OFFSET($O$2,$H31-$B$2,0)))

Excel - Find highest MULTI-cell Total

Hello!
Is it possible to have Excel find the largest two, three or four cell total within its data? For example, dates are entered as 2000, 2001, 2002, etc in column A and in column B, there's another figure, HRs. I have my document sorted by dates (columnA) and now want to see the most HRs hit over two seasons. This seems very useful and utilized in data but still under-realized.
-Most touchdowns over a two-season stretch. (most touchdowns over a THREE season stretch etc.)
-Highest-grossing 3-month stretch.
-Rainiest two days ever
-Most speeding tickets issued in two days.
-Largest two-day Jeopardy winnings.
-ETC
Can I search through an excel document and see the largest 2-day rainfall as described? Something similar to "Find All" in excel but for consecutive cells, though, that doesn't find largest I suppose. It'd be cool, if you could drag a range, say 3 cells tall, within a larger range, and Excel could find that largest totals in that larger range.
I doubt this is possible---but surely there is a way data scientists or just newspapers can organize data to find the largest total over a certain stretch? (most HR over a 5-season stretch) How could I do this? Maybe this requires a program for SQL or something? Thank you.
https://www.exceltip.com/summing/sum-the-largest-two-numbers-in-range-in-microsoft-excel.html
This seems close, but just finds the two largest figures----not the two largest consecutive figures, which is what I'm looking for.
Using offset ranges:
=MAX(B2:B12+B3:B13)
or subtotal/offset combo:
=MAX(SUBTOTAL(9,OFFSET(B1,ROW(B1:B11),0,2,1)))
(the first one gets cumbersome when extended to 3,4,5... adjacent cells)
must be entered as array formulas using CtrlShiftEnter
EDIT
If you wanted to find the first, second etc. largest pair you could use Large instead of Max:
=LARGE(B$2:B$12+B$3:B$13,ROW()-1)
or
=LARGE(SUBTOTAL(9,OFFSET(B$1,ROW(B$1:B$11),0,2,1)),ROW()-1)
and then to find the year, use Index/match:
=INDEX(A$2:A$12,MATCH(F2,SUBTOTAL(9,OFFSET(B$1,ROW(B$1:B$11),0,2,1)),0))
The only drawback of this is that if there were two pairs giving the same maximum of 84 say, the index/match would still match the year of the first one. This can be addressed but maybe that is enough for now.

Excel VBA large data runtime issue

I have large scale data (700K rows), and I'm trying to count the number
of appearance of a word within the rows, and do so for also many times (50K iterations).
I'm wondering if Excel is appropriate platform, using VBA or maybe COUNTIFS, or should I use different Platform?
If so, is there a platform that has similarity points to Excel and VBA?
Thanks!
With your small sentences in column A and the 700k lines in column A of Sheet1, this formula will count the occurrences. It's an array formula and must be entered with Ctrl+Shift+Enter.
=SUM(--NOT(ISERR(FIND(A2,Sheet1!$A$1:$A$700000))))
To calculate 200 small sentences took about 20 seconds on my machine. If that's an indication, it will take about 1.5 hours to calculate 50k small sentences. You should probably find a better tool or at least hit calculate right before you leave for lunch. Definitely test it on a smaller number to make sure it gives you the answers you want. If you don't have to do this often, maybe 1.5 hours is palatable.

Resources