Looking for a better formula in Excel to Calculate Duplicates

Looking for a better formula in Excel to Calculate Duplicates - excel

I have 621224 * 1 data in my Excel Sheet and have to calculate total no. of duplicates in the sheet but it takes a lot too much time with =IF(COUNTIF($J$1000:J14353,J5353)>1,1,0) so this formula might be taking n^2 complexity to find duplicates, I am looking for a formula that takes less time and if possible takes nlogn time, if there is in Excel
As of now I am doing this task manually taking a range of 10k which works in acceptable time and also to add on I have sorted the list
I searched for vlookup and found it will take around same time as countif

If you've sorted the data then you can use binary searching, which will be many times faster than your current linear set-up with COUNTIFS. For example:
=SUMPRODUCT(N(MATCH(A1:A750000,A:A)<>ROW(A1:A750000)))
On my machine, this produced a result in less than 1 second.
If you aren't able to first sort the data, if you have Office 365 you can perform the sorting in-formula:
=LET(ζ,A1:A750000,ξ,SORT(ζ),SUMPRODUCT(N(MATCH(ξ,ξ)<>SEQUENCE(ROWS(ζ)))))
which should still be very fast.

Ignoring IF() may faster. Try-
=SUM(--(COUNTIFS(A3:A13,A3:A13)>1))

Related

Subtracting times across a day in excel

I am working on the capstone project in of the Google Career Certificate in Data Analytics. I am using Microsoft Excel. I have to calculate the ride length based on the start and end ride times. I've inputted the formula =F2(end time)-D2(start time) which returns the ride length. Going through my entire list I have some areas where the start time is like 11pm and the end time is 1am and this is returning ###### because it is a negative number with the regular formula. I've found a modified formula that can kind of do the conversion I am looking for but it is still a bit problematic. The modified formula is =(F2-D2+(F2<D2))*24 and it seems to give an accurate ride length if I reformat the answer to number. The issue is the rest of my data is in time format and the modified ones are in number format. If I convert the number values to time, the ride length values are inaccurate.
It is tricky to make the numeric value change as well due to me using a formula. I can correct them one by one after I save Excel and it no longer stores the numbers as the formula, but there are lots of data points to change and that would be time consuming. I'm hoping to find a more concise way to solve this problem. Maybe with a better formula.
[Snippet of the chart 1

Just like everything in life, there are multiple ways to achieve things. I would have formatted the date and time into a single cell; but. if you're gathering the data from another source, that's understandable.
A simple IF statement here will work. IF the days are one apart, then take '1' day off the starting time, else do your original formula:
=IF(E4-C4=1,F4-(D4-1),F4-D4)

Excel - Find highest MULTI-cell Total

Hello!
Is it possible to have Excel find the largest two, three or four cell total within its data? For example, dates are entered as 2000, 2001, 2002, etc in column A and in column B, there's another figure, HRs. I have my document sorted by dates (columnA) and now want to see the most HRs hit over two seasons. This seems very useful and utilized in data but still under-realized.
-Most touchdowns over a two-season stretch. (most touchdowns over a THREE season stretch etc.)
-Highest-grossing 3-month stretch.
-Rainiest two days ever
-Most speeding tickets issued in two days.
-Largest two-day Jeopardy winnings.
-ETC
Can I search through an excel document and see the largest 2-day rainfall as described? Something similar to "Find All" in excel but for consecutive cells, though, that doesn't find largest I suppose. It'd be cool, if you could drag a range, say 3 cells tall, within a larger range, and Excel could find that largest totals in that larger range.
I doubt this is possible---but surely there is a way data scientists or just newspapers can organize data to find the largest total over a certain stretch? (most HR over a 5-season stretch) How could I do this? Maybe this requires a program for SQL or something? Thank you.
https://www.exceltip.com/summing/sum-the-largest-two-numbers-in-range-in-microsoft-excel.html
This seems close, but just finds the two largest figures----not the two largest consecutive figures, which is what I'm looking for.

Using offset ranges:
=MAX(B2:B12+B3:B13)
or subtotal/offset combo:
=MAX(SUBTOTAL(9,OFFSET(B1,ROW(B1:B11),0,2,1)))
(the first one gets cumbersome when extended to 3,4,5... adjacent cells)
must be entered as array formulas using CtrlShiftEnter
EDIT
If you wanted to find the first, second etc. largest pair you could use Large instead of Max:
=LARGE(B$2:B$12+B$3:B$13,ROW()-1)
or
=LARGE(SUBTOTAL(9,OFFSET(B$1,ROW(B$1:B$11),0,2,1)),ROW()-1)
and then to find the year, use Index/match:
=INDEX(A$2:A$12,MATCH(F2,SUBTOTAL(9,OFFSET(B$1,ROW(B$1:B$11),0,2,1)),0))
The only drawback of this is that if there were two pairs giving the same maximum of 84 say, the index/match would still match the year of the first one. This can be addressed but maybe that is enough for now.

Using MIN IF function to find peaks and valleys in data

I have a cyclic data set. There are low cycles and high cycles, each with slightly different mins and maxes. i need to find the values of each min and max. I have attached picture of a simplified version of what I have. I know roughly what time the peak/valley will occur, so i thought i could use the min if function to isolate each extreme value. For example, it i wanted to find a valley between time 1 and time 5, i would use this formula:
=MIN(IF(1< time<5,data))
This just yields 0 for some reason. It sort of worked once but instead of isolating the minimum for the selected time period, it just found the minimum for the whole column. What am I doing wrong here? Is what i am trying to do possible without using VBA? This is a template for work that others will use and not everyone is able to use macro-enabled workbooks so I'd like to avoid that.

Use this:
=MIN(IF(time>5,IF(time<12,data)))
It is an array formula and needs to be confirmed with Ctrl-Shift-Enter

Optimizing Excel formulas - SUMPRODUCT vs SUMIFS/COUNTIFS

According to a couple of web sites, SUMIFS and COUNTIFS are faster than SUMPRODUCT (for example: http://exceluser.com/blog/483/excels-sumifs-or-sumproduct-which-is-faster.html). I have a worksheet with an unknown number of rows (around 200 000) and I'm calculating performance reports with the numbers. I have over 6000 times almost identical SUMPRODUCT formulas with a couple of difference each times (only the conditions change).
Here is an example of what I got:
=IF(AFO4>0,
(SUMPRODUCT((Sheet1!$N:$N=$A4)
*(LEFT(Sheet1!$H:$H,2)="1A")
*(Sheet1!$M:$M<>"service catalog")
*(Sheet1!$J:$J="incident")
*(Sheet1!$I:$I<>"self-serve")
*(Sheet1!$AK:$AK=AFM$1)
*(Sheet1!$E:$E>=$E$1)
*(Sheet1!$E:$E<$E$2))
+SUMPRODUCT((Sheet1!$AJ:$AJ=$C4)
*(LEFT(Sheet1!$H:$H,2)="1A")
*(Sheet1!$M:$M<>"service catalog")
*(Sheet1!$J:$J="incident")
*(Sheet1!$I:$I="self-serve")
*(Sheet1!$AK:$AK=AFM$1)
*(Sheet1!$E:$E>=$E$1)
*(Sheet1!$E:$E<$E$2)))/AFO4,0)
Calculating that thing takes a little bit more than 1 second. Since I have more than 6000 of those formulas, it takes a little bit over an hour to calculate everything.
So, I'm now looking at how I could optimize that formula. Could I convert it to SUMIFS? Would it be faster? All I'm adding up here is 0s and 1s, I'm just counting the number of rows in my data source (Sheet1) where the set of conditions is met. Maybe COUNTIFS would work better?
I would appreciate any help to gain some execution time since we need to execute the formulas every month.
I can use VBA if that helps, but I always heard that Excel formulas were usually faster.

Instead of formulas, Why not use a PivotTable to crunch the numbers? You potentially face a longer one-time hit to load the data into the PivotCache, but after that, you should find a PivotTable recalculates much faster in response to filter changes than these computationally expensive formulas. Is there any reason you're not using one?
Here's some content from a book I'm writing, where I compare SUMPRODUCT, SUMIFS, DSUM, PivotTables, the Advanced Filter, and something called Range Slicing (which uses clever combinations of INDEX/MATCH on sorted data) to conditionally sum the records in a table that contains over 1 million sales records, based on choices you make from 10 different dropdowns:
Those dropdowns allow you to filter the database by a combination of the Store, Segment, Species, Gender, Payment, Cust. History, Order Status, Delivery Instructions, Membership Type, and Order Channel columns. So there’s some pretty mammoth filtering and aggregation going on in order to reduce those 1 million records down to just one sum. The file outlines six different ways to achieve this outcome, the first three of which are shown in the screenshot below:
As you’d expect, when all those dropdowns are set to the same settings, you get exactly the same answer out of all six approaches. But what you won’t expect is just how slow SUMPRODUCT is to calculate a new answer if you change one of those dropdowns, compared to the other approaches.
In fact, it turns out that the SUMIFS approach is 15 times faster than the SUMPRODUCT one at coming up with the answer on this mammoth dataset. But that’s nothing: The range slicing approach is 56 times faster!
The Range Slicing approach works by sorting your source data, and then using a series of clever formulas in helper columns to cleverly identifying exactly where any records of interest sit within that sorted data. This means that you can then directly sum just the few records that match rather than having to do a complex criteria match against hundreds of thousands of rows (or against a million rows, as in the example here).
Here’s how that looks in terms of my sample file. The number in the Rows helper column on the right-hand side shows that through some clever elimination, the SUM function at the bottom has to process only 18 rows of data (rows 292996 through 293014) rather than all 1 million rows. In other words, this is mighty efficient.
And here’s the second group of alternatives:
Yup, you can quite easily use a PivotTable here. And the PivotTable approach seems to be around 6 times faster than SUMPRODUCT—although you get a small amount of extra delay when calling up the filters, and the first time you perform a filter operation it takes quite a bit longer again, as Excel must load the PivotCache into memory. But let’s face it: Setting up the PivotTable in the first place is the easiest of any of these approaches, so it has my vote.
The DSUM approach is 12 times faster than SUMPRODUCT. That’s not as good as SUMIFS, but it’s still a significant improvement. The Advanced Filter approach is only 4 times faster than SUMPRODUCT—which isn’t really surprising because what it does is grab an extract of all records from the source data that match the criteria in that list, dump it into the spreadsheet, and then sum the result.

1st SUMPRODUCT could become
=COUNTIFS(Sheet1!$N:$N,$A4,Sheet1!$H:$H,"1A*",Sheet1!$M:$M,"<>service catalog",Sheet1!$J:$J,"incident",Sheet1!$I:$I,"<>self-serve",Sheet1!$AK:$AK,AFM$‌1,Sheet1!$E:$E,">="&$E$1,Sheet1!$E:$E,"<"&$E$2)
The LEFT part can be handled by a wildcard, as shown
change the second part along the same lines

Excel VBA large data runtime issue

I have large scale data (700K rows), and I'm trying to count the number
of appearance of a word within the rows, and do so for also many times (50K iterations).
I'm wondering if Excel is appropriate platform, using VBA or maybe COUNTIFS, or should I use different Platform?
If so, is there a platform that has similarity points to Excel and VBA?
Thanks!

With your small sentences in column A and the 700k lines in column A of Sheet1, this formula will count the occurrences. It's an array formula and must be entered with Ctrl+Shift+Enter.
=SUM(--NOT(ISERR(FIND(A2,Sheet1!$A$1:$A$700000))))
To calculate 200 small sentences took about 20 seconds on my machine. If that's an indication, it will take about 1.5 hours to calculate 50k small sentences. You should probably find a better tool or at least hit calculate right before you leave for lunch. Definitely test it on a smaller number to make sure it gives you the answers you want. If you don't have to do this often, maybe 1.5 hours is palatable.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string