Excel VBA large data runtime issue - excel

I have large scale data (700K rows), and I'm trying to count the number
of appearance of a word within the rows, and do so for also many times (50K iterations).
I'm wondering if Excel is appropriate platform, using VBA or maybe COUNTIFS, or should I use different Platform?
If so, is there a platform that has similarity points to Excel and VBA?
Thanks!

With your small sentences in column A and the 700k lines in column A of Sheet1, this formula will count the occurrences. It's an array formula and must be entered with Ctrl+Shift+Enter.
=SUM(--NOT(ISERR(FIND(A2,Sheet1!$A$1:$A$700000))))
To calculate 200 small sentences took about 20 seconds on my machine. If that's an indication, it will take about 1.5 hours to calculate 50k small sentences. You should probably find a better tool or at least hit calculate right before you leave for lunch. Definitely test it on a smaller number to make sure it gives you the answers you want. If you don't have to do this often, maybe 1.5 hours is palatable.

Related

Looking for a better formula in Excel to Calculate Duplicates

I have 621224 * 1 data in my Excel Sheet and have to calculate total no. of duplicates in the sheet but it takes a lot too much time with =IF(COUNTIF($J$1000:J14353,J5353)>1,1,0) so this formula might be taking n^2 complexity to find duplicates, I am looking for a formula that takes less time and if possible takes nlogn time, if there is in Excel
As of now I am doing this task manually taking a range of 10k which works in acceptable time and also to add on I have sorted the list
I searched for vlookup and found it will take around same time as countif
If you've sorted the data then you can use binary searching, which will be many times faster than your current linear set-up with COUNTIFS. For example:
=SUMPRODUCT(N(MATCH(A1:A750000,A:A)<>ROW(A1:A750000)))
On my machine, this produced a result in less than 1 second.
If you aren't able to first sort the data, if you have Office 365 you can perform the sorting in-formula:
=LET(ζ,A1:A750000,ξ,SORT(ζ),SUMPRODUCT(N(MATCH(ξ,ξ)<>SEQUENCE(ROWS(ζ)))))
which should still be very fast.
Ignoring IF() may faster. Try-
=SUM(--(COUNTIFS(A3:A13,A3:A13)>1))

Excel - Find highest MULTI-cell Total

Hello!
Is it possible to have Excel find the largest two, three or four cell total within its data? For example, dates are entered as 2000, 2001, 2002, etc in column A and in column B, there's another figure, HRs. I have my document sorted by dates (columnA) and now want to see the most HRs hit over two seasons. This seems very useful and utilized in data but still under-realized.
-Most touchdowns over a two-season stretch. (most touchdowns over a THREE season stretch etc.)
-Highest-grossing 3-month stretch.
-Rainiest two days ever
-Most speeding tickets issued in two days.
-Largest two-day Jeopardy winnings.
-ETC
Can I search through an excel document and see the largest 2-day rainfall as described? Something similar to "Find All" in excel but for consecutive cells, though, that doesn't find largest I suppose. It'd be cool, if you could drag a range, say 3 cells tall, within a larger range, and Excel could find that largest totals in that larger range.
I doubt this is possible---but surely there is a way data scientists or just newspapers can organize data to find the largest total over a certain stretch? (most HR over a 5-season stretch) How could I do this? Maybe this requires a program for SQL or something? Thank you.
https://www.exceltip.com/summing/sum-the-largest-two-numbers-in-range-in-microsoft-excel.html
This seems close, but just finds the two largest figures----not the two largest consecutive figures, which is what I'm looking for.
Using offset ranges:
=MAX(B2:B12+B3:B13)
or subtotal/offset combo:
=MAX(SUBTOTAL(9,OFFSET(B1,ROW(B1:B11),0,2,1)))
(the first one gets cumbersome when extended to 3,4,5... adjacent cells)
must be entered as array formulas using CtrlShiftEnter
EDIT
If you wanted to find the first, second etc. largest pair you could use Large instead of Max:
=LARGE(B$2:B$12+B$3:B$13,ROW()-1)
or
=LARGE(SUBTOTAL(9,OFFSET(B$1,ROW(B$1:B$11),0,2,1)),ROW()-1)
and then to find the year, use Index/match:
=INDEX(A$2:A$12,MATCH(F2,SUBTOTAL(9,OFFSET(B$1,ROW(B$1:B$11),0,2,1)),0))
The only drawback of this is that if there were two pairs giving the same maximum of 84 say, the index/match would still match the year of the first one. This can be addressed but maybe that is enough for now.

Excel - Evaluate multiple cells in a row and create report or display showing lowest to highest

In an Excel 2003 spreadsheet, I have the top row of cells calculating the number of days and hours I have worked on something based on data I put in the cells below for each category. For example I enter the time spent on Programming, Spoken languages, house, piano, guitar...etc. The top cell in each category will keep track of and display how many days and hours I spent as I add the time spent for each category each day. I want to evaluate this top row and then list in a "report" (like a pop up box or another tab or something) in order from least amount of time to the most amount of time. This is so I can see at a glance which category is falling behind and what I need to work on. Can this be done in Excel? VBA? Or do I have to write a program from scratch in C# or Java? Thanks!
VH
Unbelievable... I've been scolded for trying to understand an answer and requested to mark this question answered. I don't see anything to do this and could not find anything that tells you how, so I'm just writing it here. MY QUESTION WAS ANSWERED... But thanks anyway...
Consider the following screenshot:
The chart data is built with formulas in columns H3:I3 and below. The formulas are
H3 =INDEX($B$3:$F$3,MATCH(SMALL($B$2:$F$2,ROW(A1)),$B$2:$F$2,0))
I3 =INDEX($B$2:$F$2,MATCH(SMALL($B$2:$F$2,ROW(A1)),$B$2:$F$2,0))
Copy down and build a horizontal bar chart from the data. If you want to change the order of the source data, use LARGE() instead of SMALL().
Alternative Approach
Instead of recording your data in a matrix, consider recording in a flat table with columns for date, category and time spent. That data can then easily be evaluated in many possible ways without using any formulas at all. The screenshot below shows a pivot table and chart where the data is sorted by time spent.
Edit after inspecting file:
Swap rows 2 and 3. Then you can choose one of the approaches outlined above.
Consider entering the study time as time values. It is not immediately clear if your entry 2.23 means 2 hrs and 23 minutes, or 2 hrs plus 0.23 of an hour, which totals to 2hrs, 13 minutes.
If you are using the first method, then all your sums involving decimals are off. For example, the total for column B is 7.73 as you sum it. Is that meant to be 7 hrs and 73 minutes? That would really be 8 hrs and 13 minutes, no? Or is it meant to be 7 hrs and 43 minutes? You can see how this is confusing. Use the colon to separate hrs and minutes and - hey - you can see human readable time values and don't have to convert minute values into decimals.

Optimizing Excel formulas - SUMPRODUCT vs SUMIFS/COUNTIFS

According to a couple of web sites, SUMIFS and COUNTIFS are faster than SUMPRODUCT (for example: http://exceluser.com/blog/483/excels-sumifs-or-sumproduct-which-is-faster.html). I have a worksheet with an unknown number of rows (around 200 000) and I'm calculating performance reports with the numbers. I have over 6000 times almost identical SUMPRODUCT formulas with a couple of difference each times (only the conditions change).
Here is an example of what I got:
=IF(AFO4>0,
(SUMPRODUCT((Sheet1!$N:$N=$A4)
*(LEFT(Sheet1!$H:$H,2)="1A")
*(Sheet1!$M:$M<>"service catalog")
*(Sheet1!$J:$J="incident")
*(Sheet1!$I:$I<>"self-serve")
*(Sheet1!$AK:$AK=AFM$1)
*(Sheet1!$E:$E>=$E$1)
*(Sheet1!$E:$E<$E$2))
+SUMPRODUCT((Sheet1!$AJ:$AJ=$C4)
*(LEFT(Sheet1!$H:$H,2)="1A")
*(Sheet1!$M:$M<>"service catalog")
*(Sheet1!$J:$J="incident")
*(Sheet1!$I:$I="self-serve")
*(Sheet1!$AK:$AK=AFM$1)
*(Sheet1!$E:$E>=$E$1)
*(Sheet1!$E:$E<$E$2)))/AFO4,0)
Calculating that thing takes a little bit more than 1 second. Since I have more than 6000 of those formulas, it takes a little bit over an hour to calculate everything.
So, I'm now looking at how I could optimize that formula. Could I convert it to SUMIFS? Would it be faster? All I'm adding up here is 0s and 1s, I'm just counting the number of rows in my data source (Sheet1) where the set of conditions is met. Maybe COUNTIFS would work better?
I would appreciate any help to gain some execution time since we need to execute the formulas every month.
I can use VBA if that helps, but I always heard that Excel formulas were usually faster.
Instead of formulas, Why not use a PivotTable to crunch the numbers? You potentially face a longer one-time hit to load the data into the PivotCache, but after that, you should find a PivotTable recalculates much faster in response to filter changes than these computationally expensive formulas. Is there any reason you're not using one?
Here's some content from a book I'm writing, where I compare SUMPRODUCT, SUMIFS, DSUM, PivotTables, the Advanced Filter, and something called Range Slicing (which uses clever combinations of INDEX/MATCH on sorted data) to conditionally sum the records in a table that contains over 1 million sales records, based on choices you make from 10 different dropdowns:
Those dropdowns allow you to filter the database by a combination of the Store, Segment, Species, Gender, Payment, Cust. History, Order Status, Delivery Instructions, Membership Type, and Order Channel columns. So there’s some pretty mammoth filtering and aggregation going on in order to reduce those 1 million records down to just one sum. The file outlines six different ways to achieve this outcome, the first three of which are shown in the screenshot below:
As you’d expect, when all those dropdowns are set to the same settings, you get exactly the same answer out of all six approaches. But what you won’t expect is just how slow SUMPRODUCT is to calculate a new answer if you change one of those dropdowns, compared to the other approaches.
In fact, it turns out that the SUMIFS approach is 15 times faster than the SUMPRODUCT one at coming up with the answer on this mammoth dataset. But that’s nothing: The range slicing approach is 56 times faster!
The Range Slicing approach works by sorting your source data, and then using a series of clever formulas in helper columns to cleverly identifying exactly where any records of interest sit within that sorted data. This means that you can then directly sum just the few records that match rather than having to do a complex criteria match against hundreds of thousands of rows (or against a million rows, as in the example here).
Here’s how that looks in terms of my sample file. The number in the Rows helper column on the right-hand side shows that through some clever elimination, the SUM function at the bottom has to process only 18 rows of data (rows 292996 through 293014) rather than all 1 million rows. In other words, this is mighty efficient.
And here’s the second group of alternatives:
Yup, you can quite easily use a PivotTable here. And the PivotTable approach seems to be around 6 times faster than SUMPRODUCT—although you get a small amount of extra delay when calling up the filters, and the first time you perform a filter operation it takes quite a bit longer again, as Excel must load the PivotCache into memory. But let’s face it: Setting up the PivotTable in the first place is the easiest of any of these approaches, so it has my vote.
The DSUM approach is 12 times faster than SUMPRODUCT. That’s not as good as SUMIFS, but it’s still a significant improvement. The Advanced Filter approach is only 4 times faster than SUMPRODUCT—which isn’t really surprising because what it does is grab an extract of all records from the source data that match the criteria in that list, dump it into the spreadsheet, and then sum the result.
1st SUMPRODUCT could become
=COUNTIFS(Sheet1!$N:$N,$A4,Sheet1!$H:$H,"1A*",Sheet1!$M:$M,"<>service catalog",Sheet1!$J:$J,"incident",Sheet1!$I:$I,"<>self-serve",Sheet1!$AK:$AK,AFM$‌​1,Sheet1!$E:$E,">="&$E$1,Sheet1!$E:$E,"<"&$E$2)
The LEFT part can be handled by a wildcard, as shown
change the second part along the same lines

How can I implement 'balanced' error spreading functionality in Excel?

I have a requirement in Excel to spread small; i.e. pennies, monetry rounding errors fairly across the members of my club.
The error arises when I deduct money from members; e.g. £30 divided between 21 members is £1.428571... requiring £1.43 to be deducted from each member, totalling £30.03, in order to hit the £30 target.
The approach that I want to take, continuing the above example, is to deduct £1.42 from each member, totalling £29.82, and then deduct the remaining £0.18 using an error spreading technique to randomly take an extra penny from 18 of the 21 members.
This immediately made me think of Reservoir Sampling, and I used the information here: Random selection,
to construct the test Excel spreadsheet here: https://www.dropbox.com/s/snbkldt6e8qkcco/ErrorSpreading.xls, on Dropbox, for you guys to play with...
The problem I have is that each row of this spreadsheet calculates the error distribution indepentently of every other row, and this causes some members to contribute more than their fair share of extra pennies.
What I am looking for is a modification to the Resevoir Sampling technique, or another balanced / 2 dimensional error spreading methodology that I'm not aware of, that will minimise the overall error between members across many 'error spreading' rows.
I think this is one of those challenging problems that has a huge number of other uses, so I'm hoping you geniuses have some good ideas!
Thanks for any insight you can share :)
Will
I found a solution. Not very elegant, through.
You have to use two matrix. In the first you get completely random number, chosen with =RANDOM() and in the second you choose the n greater value
Say that in F30 you have the first
=RANDOM()
cell.
(I have experimented with your sheet.)
Just copy a column of n (in your sheet 8) in column A)
In cell F52 you put:
=IF(RANK(F30,$F30:$Z30)<=$A52, 1, 0)
Until now, if you drag left and down the formulas, you have the same situation that is in your sheet (only less elegant und efficient).
But starting from the second row of random number you could compensate for the penny esbursed.
In cell F31 you put:
=RANDOM()-SUM(F$52:F52)*0.5
(pay attention to the $, each random number should have a correction basated on penny already spent.)
If the $ are ok you should be OK dragging formulas left and down. You could also parametrize the 0.5 and experiment with other values. With 0,5 I have a error factor (the equivalent of your cell AB24) between 1 and 2

Resources