RemoveDuplicates is not working as expected - excel

I have a large data set that is exported from a website. I use a macro in my main ‘filter’ workbook to find the file and copy the data from Sheet1 of the exported file into Sheet1 of the filter workbook.
Once the data is copied into Sheet1 of the filter workbook, I use VBA to copy columns A/B/D/F/H/Z/AA/etc from Sheet 1 of the filter workbook into Sheet2 of the filter workbook AND also at the same time, I use this code here to attempt to delete any duplicate rows:
Worksheets("Sheet2").Range("A:DZ").RemoveDuplicates Columns:=15, Header:=xlYes
I am finding though that the RemoveDuplicates is not working as expected.
As an example, Sheet1 in the filter workbook (and export workbook) has 3344 rows. When I manually filter using conditional formatting to highlight duplicates, I can find 314 rows listed as duplicates (meaning 157 genuine rows of actual data and 157 rows which are duplicates of that actual data. I haven’t found any examples of duplicates existing more than one time each). So on Sheet2 I was expecting to see 3344 – 157 = 3157 Rows of real data. I don’t see that, or even 3030 rows (3344-314). Instead, I am getting 1897 rows paste into Sheet2, which is a difference of 1447 rows (1290 less rows than expected).
On top of that, I am manually checking the data to see what is up by using Control-F in the column and am finding that in some instances that both of the two duplicated items are missing from Sheet2 (versus it just deleting the one duplicate row).
The duplicate data is not on sequential rows and is scattered throughout the column in Sheet2. But when I sort it before I attempt to DeleteDuplicates, it does not seem to impact its accuracy or make it right.
I have also tried using the DeleteDuplicates in different locations of the code / at different times but the results are always off by the same amount (1447 missing rows instead of 157 missing rows).
I found only a few articles on this site, this one was the closest but not quiet my issue: Delete Rows With Duplicate Data VBA
And other help sites/forums mention there was some bug with office 2007 that prevents this from working (am using 2013).
Does anyone know where I can find a work around, or if one exists - or if this is still a real bug or just a problem with the code line I have above.
Adding bits of code I use in this example in case it is something within these that is causing the problem…
Part of the copy code:
wsFromSheet.Cells.Copy
wsToFile.Range("A1").PasteSpecial xlPasteValuesAndNumberFormats
Application.CutCopyMode = False
wbFromFile.Close True
Part of the ‘paste column code’:
Sheets("Sheet2").Rows(1 & ":" & Sheets("Sheet2").Columns.Count).ClearContents 'Clear from row 1 down
LastRowFromSiteTracker = xSht.Cells.SpecialCells(xlCellTypeLastCell).Row 'original report has 128 columns
xSht.Range("B1:B" & LastRowFromSiteTracker).Copy ySht.Cells(Rows.Count, "A").End(xlUp) 'customer name
‘repeat a bunch of times, then…
Application.CutCopyMode = False 'do I need this line?
Worksheets("Sheet2").Range("A:DZ").RemoveDuplicates Columns:=15, Header:=xlYes
End Sub
Example/sample of data:
Row Source Data Expected Data Actual Data
1 1000474608 1000474608 1000474608 (Dup missing from sheet2)
2 1000474608 1000487672 1000487672
3 1000487672 1000487674 1000487674
4 1000487674 1000487676 1000487676 (missing from sheet2, wasn’t a dup)
5 1000487676 1000487678 1000487678
6 1000487678 1000487680 1000487680
7 1000487680 1000487682 1000487682 (Dup missing from sheet2)
8 1000487682 1000520278 1000520278
9 1000487682 1000520280 1000520280
10 1000520278 1000520282 1000520282 (Is there)
11 1000520280 1000520286 1000520286
12 1000520282 1000520336 1000520336 (Is there)
13 1000520282 1000520338 1000520338
14 1000520286 1000520392 1000520392
15 1000520286 1000520394 1000520394
16 1000520336 1000530333 1000530333
17 1000520338
18 1000520392
19 1000520394
20 1000530333
EDIT: EDIT: EDIT:
So I've tried to do some more manual testing, and tried two separate things with the same set of data, getting two different results. I used the Conditional Formatting - Highlight Duplicates from the main Home ribbon and the Remove Duplicates from the Data ribbon.
The Remove Duplicates button finds and removed 163 items in Column P and leaves 3181 rows.
However, when I use the Highlight Duplicates conditional format finds 314 items that are duplicated within Column P, leaving 3030 non duplicates.
It does not make sense that these two numbers do not match. I thought it has something to do with the duplicates themselves - as most duplicated items have only one dup of itself (123123 shows up in two rows) but then just a small handful of rows are duplicated multiple times (234234 shows up in 4 or 6 columns).
So instead of using the manual way, I used the suggestions I've found online, and both of these also provide differing results when run:
3344 Base records
1897 left after scrub of duplicates (1446 removed)
Dim tmpAddress As String
tmpAddress = "A2:BZ" & Worksheets("ColScrub").UsedRange.Rows.Count
Worksheets("ColScrub").Range(tmpAddress).RemoveDuplicates Columns:=15, Header:=xlNo
3181 left after scrub of duplicates (162 removed)
Cells.Select
ActiveSheet.Range("$A$1:$EI$3345").RemoveDuplicates Columns:=31, Header:=xlYes

My further experience now shows that UsedRange is completely unreliable if you have blank rows or columns. UsedRange only includes rows/columns up to the blank one. I have found a better way to get the last of each. These function use 2 basic assumptions, which should hold true for mostof your spreadsheets.
For LastRow there is a "key" column, i.e. a column where the MUST be
data, for example an ID column
For LastCol there should be a header row (or row where you can guarantee the last column is filled)
With this in mind, I have created the following 2 functions retrieve the last values accurately, every time ... well, almost (my complete function handles issues of the footer rows with merged cells)
For the last row
Public Function Excel_GetLastRow(xlSheet As Excel.Worksheet, _
ByVal KeyColumn As Long) As Long
' This could be adjusted for exact max rows Excel allows
Const MAX_XL_ROWS As Long = 1048000
Excel_GetLastRow = xlSheet.Cells(MAX_XL_ROWS, KeyColumn).End(xlUp).row
End Function
And for last column
Public Function Excel_GetLastCol(xlSheet As Excel.Worksheet, _
ByVal HeaderRow As Long) As Long
' This could be adjusted for exact max columns Excel allows
Const MAX_XL_COLS As Long = 16000
Excel_GetLastCol = xlSheet.Cells(MAX_XL_COLS, HeaderRow).End(xlToLeft).Column
End Function
Using these values you can now set your complete data range successfully.
top left = Cells(HeaderRow + 1, 1)
bottom right = Cells(LastRow, LastCol)
My complete functions include error handling and allowances for possible merged cells in the footer row and last header column, but you get the idea.
Art

1) you are only clearing as many rows as your have columns, not rows
Also, you may not be clearing anything, so use the UsedRange.Rows for proper rowcount
This line ...
Sheets("Sheet2").Rows(1 & ":" & Sheets("Sheet2").Columns.Count).ClearContents
Should read ...
Sheets("Sheet2").Rows(1 & ":" & Sheets("Sheet2").UsedRange.Rows.Count).ClearContents
Without properly clearing the old data, unpredictable results may occur.
2) Excel VBA seems rather quirky in that many things won't work correctly without specifically "selecting" the object(s) in question AND specifiying complete (not columnar) ranges
3) I also prefer to leave out the header row (note the "A2") and pass Header:=xlNo
4) Unless you have more than 625 columns, BZ should be far enough
So add this to your code ...
Dim tmpAddress as String
tmpAddress = "A2:BZ" & Worksheets("Sheet2").UsedRange.Rows.Count
Worksheets("Sheet2").Activate
Worksheets("Sheet2").Range(tmpAddress).RemoveDuplicates Columns:=15, Header:=xlNo
Hope this helps :)

I don't know why (or if/how) this is any different, but this seems to be the closest I can get to true removal of duplicates. I wanted to add it here as an answer for others in similar situations.
Dim lastrow As Long
With ThisWorkbook.Worksheets("ColScrub")
If Application.WorksheetFunction.CountA(.Cells) <> 0 Then
lastrow = .Cells.Find(What:="*", _
After:=.Range("A1"), _
Lookat:=xlPart, _
LookIn:=xlFormulas, _
SearchOrder:=xlByRows, _
SearchDirection:=xlPrevious, _
MatchCase:=False).Row 'Change P1 back to A1 if needed
Else
lastrow = 1
End If
.Range("A1:AZ" & lastrow).RemoveDuplicates Columns:=Array(16), Header:=xlYes
End With
I have to go through each row visually to prove this works I think, and to rule out that it isn't deleting things that should not be deleted - but this seems to get rid of the 'double duplicates' (4-6 lines items instead of 2 like the other duplicates).

Related

VBA Range.End(xlDown) stops at last visible row

I am doing a simple VBA script in Microsoft Excel which iterates a list of cars and a list of information about when the cars were refueled to provide an overview of how many kilometers each car is driving each month.
I make use of the Range.End property to calculate the number of rows with data and then loop through the indicies.
Set Data = Worksheets("Tankninger") ' Danish for refuellings
NumRows = Data.Range("A1", Data.Range("A1").End(xlDown)).Rows.Count
For x = 1 To NumRows
' Process data
Next
Everything seemed to be working fine, however I found that if someone applied a filter to e.g. the sheet with refuelling data - e.g. only showing data related to car A, then NumRows would be assigned the index of the last visible row.
Example: if the refuling sheet contains 100 records and the records related car A are located on row 50-60, then NumRows would be assigned the value 60 - resulting in my script ignoring the last 40 records.
Is there a way to make the Range.End property ignore any filter applied to sheet, or will I have to change the implementation to use a while-loop instead?
I ended up replacing the for-loop with a while-loop. This allowed me to access every cell regardless of any filtering applied to the sheets.
Set Data = Worksheets("Tankninger") ' Danish for refuellings
r = 2
While Not IsEmpty(Cars.Cells(r, 1).value)
' Process data
Wend
What you can do is add the following in your code to remove filters before you find the last row with data.
'Remove all filters
Worksheets("Sheet1").Activate
On Error Resume Next
ActiveSheet.ShowAllData

Add and remove cells based on values

Original Post: Here (New post as it started to get very clunky and dragged away from the issues at hand)
I am looking to automatically grab data from an excel CRM output and take certain values into a new sheet. I have had a bit of luck with my progress, but I am still struggling to adapt the code properly.
First Iteration of Code:
Sub Client_CRM()
Range("A4:A44,C4:C44,G4:H44").Select
Selection.Copy
Sheets("Output Sheet").Select
Range("A1").Select
ActiveSheet.Paste
End Sub
Current code:
Sub Client_CRM()
Dim ClientStartRow As Long, ClientEndRow As Long
Dim Listed As Long
Set PortfolioRange = Worksheets("Client Paste").Range("A:M")
Set Listed = Worksheets("Client Paste").Range("A:A")
With Sheets("Client Paste")
Sheets("Output Sheet").Cells.Clear
Worksheets("Client Paste").Activate
ClientStartRow = .Range("A3").Row
':A").Find(What:="Listed", after:=.Range("A1")).Row
ClientEndRow = .Range("A:A").Find(What:="Totals", after:=.Range("A3"), LookIn:=xlValues, lookat:=xlPart, MatchCase:=False).Row
Range(ClientStartRow & ":" & ClientEndRow).Select
Selection.Columns(1).Copy
Sheets("Output Sheet").Select
Range("A3").Select
ActiveSheet.Paste
Sheets("Output Sheet").Range("B1:B70") = Application.VLookup(Listed, PortfolioRange, 8, False)
Sheets("Output Sheet").Range("C1:C70") = Application.VLookup(Listed, PortfolioRange, 3, False)
Sheets("Output Sheet").Range("D1:D70") = Application.VLookup(Listed, PortfolioRange, 7, False)
End With
End Sub
As you can see, I've slowly added and learnt more things throughout today.
What I am now looking to do is:
Find a better way to copy the columns over to the new sheet. **An issue that I have encountered is that maybe 1/10 CRM exports have an additional column, so the VLOOKUP can't accurately be used 100% - The CRM export has headers. Can I use some sort of code to grab these columns by value? They are exported and on Row 2. "Listed" "Quantity" "MV" "PW" are the 4 headings. Usually they are columns: 1,3,7,8 but in a rare instance they are 1,3,8,9...
Find a way to remove certain "blacklist" products. All products generally have a 3 part code that they are identified as. There are certain 3 part codes I do not want included and I want to be able to update this as time goes on. Ideally, I'd like to make a separate sheet with these codes and if they match to anything from the export, they aren't copied over...
Some product codes have 5 characters instead of 3, I'd like these ones to be coped in the same list but added to a separate list (Unsure if this is possible?)
Update:
Have worked out how to get the code to bring the 4 columns I want regardless of their order over.
Set PPSExport = Range("A2:M2")
For Each cell In PPSExport
If cell.Value = "Asset" Then
cell.EntireColumn.Copy
ActiveSheet.Paste Destination:=Worksheets("Output Sheet").Range("A:A")
End If
If cell.Value = "Quantity" Then
cell.EntireColumn.Copy
ActiveSheet.Paste Destination:=Worksheets("Output Sheet").Range("B:B")
End If
If cell.Value = "Market value" Then
cell.EntireColumn.Copy
ActiveSheet.Paste Destination:=Worksheets("Output Sheet").Range("C:C")
End If
If cell.Value = "Portfolio weight %" Then
cell.EntireColumn.Copy
ActiveSheet.Paste Destination:=Worksheets("Output Sheet").Range("D:D")
End If
Next cell
Sheets("Output Sheet").Select
End With
Thanks for any help,
I've already learnt so much already -- any pointers would be greatly appreciated :D
Yes, all of these things are possible. I will give a brief description on how to accomplish all of these things, but I recommend that you try to research how to do each of these tasks on your own before asking another question(s). It's also a good idea to keep the scope of your question limited. For example, you are asking about 3 loosely related items here. Yes, they are related to one another via your project, but in the general world of VBA programming, they are not. This will keep the conversation focused and easier to follow.
Find a better way to copy the columns over to the new sheet.
You made a great observation: your data is imported with headers. And your proposal is possible. You can certainly use the headers of a range (or table) to copy data. You could:
Iterate through all cells in the header row
If you come across one you are interested it, copy all of the data in that column to the new sheet
If you come across a column header you are not interested in, just skip it and move to the next one
Find a way to remove certain "blacklist" products.
This is possible, and your proposed solution sounds ideal to me. Keep a record of all blacklist values in a sheet, and reference that list when necessary.
Some product codes have 5 characters instead of 3, I'd like these ones
to be coped in the same list but added to a separate list
Certainly possible. Once you have your data:
Iterate through all of it and check how many characters are in the value
If there are 5, copy that data to a new location or store it somewhere
If there are not 5, move on to the next value

Vlookup based on criteria

The data I have in my "entity sheet"
entity id
source id
source entity id
HR0001
GOP
1200
HR0002
WSS
WSS1201
HR0003
GOP
1201
HR0004
WSS-T
WSST1202
HR0005
GOP
1202
HR0006
GOP
1203
HR0007
WSS-S
WSSS1203
HR0008
GOP
1204
HR0009
GOP
1205
HR0010
GOP
1206
HR0011
WSS-R
WSSR1204
HR0012
WSS-T
WSST1205
HR0013
WSS-S
WSSS1206
HR0014
GOP
1207
HR0015
WSS-T
WSSS1207
HR0006
WSS-S
WSSS1208
HR0007
GOP
1208
HR0008
WSS-R
WSST1209
HR0009
WSS-S
WSSS1210
In my working sheet, I need the source entity id (column c) data, by doing a VLOOKUP on the entity id (column A), based on source id (column b). that is I need only those beginning with "WS" IDs on my working sheet. My code is
Sub Test()
Worksheets("working sheet").Activate
Dim sht, sht1 As Worksheet
Dim i As Long, LR As Long
Set sht = ActiveWorkbook.Worksheets("working sheet")
Set sht1 = ActiveWorkbook.Worksheets("entity sheet")
LR = sht.UsedRange.Rows.Count
With sht
For i = 2 To LR
If InStr(sht1.Range("B" & i).Value, "WS") Then
sht.Range("B" & i).Value = (Application.VLookup(.Range("A" & i).Value, Worksheets("entity sheet").Range("A2:C5000"), 3, False))
End If
Next i
End With
End Sub
desired result - in the working sheet
entity id
source entity id - WSS
HR0001
HR0002
WSS1201
HR0003
HR0004
WSST1202
HR0005
HR0006
WSSS1208
HR0007
WSSS1203
HR0008
WSST1209
HR0009
WSSS1210
HR0010
HR0011
WSSR1204
HR0012
WSST1205
HR0013
WSSS1206
HR0014
HR0015
WSSS1207
Took me a little while but... I've got two different versions for you: one with VBA and one with just formulas.
With VBA
The issue you had was that VLOOKUP returns the first match but you needed to satisfy two criteria (that is: (i) match on entity id and (ii) match on source id begins with "WS").
This meant that you either had to:
use a formula that could match both criteria at the same time, OR
find all matches with the first criteria (e.g. with FIND) and then loop through the results to match the second criteria -- probably something like this: https://www.thespreadsheetguru.com/the-code-vault/2014/4/21/find-all-instances-with-vba
I selected option #1 as I expected it would make the code shorter.
To do this, I took advantage of a trick I've used in formulas before where I can use "&" between two ranges to match on two criteria at the same time. So, instead of matching "HR0012" first and then "WS-something" second, I match "HR0012WS-something" at once. (You can view this concept by pasting =A2:A20&B2:B20 in an empty column somewhere in your entity sheet.)
The following code assumes that your active worksheet is your working sheet. Paste this code behind your working sheet, then run it when you have that sheet open.
Public Sub tester()
Dim rg As Range
Dim sSourceEntityId As String
For Each rg In Range("A2:A16")
sSourceEntityId = Evaluate("=XLOOKUP(""" & rg.Value & "WS"",entity!A2:A20&LEFT(entity!B2:B20,2),entity!C2:C20,"""",0,1)")
If Len(sSourceEntityId) > 0 Then
rg.Offset(0, 1).Value = sSourceEntityId
End If
Next rg
End Sub
If the part inside the Evaluate is not clear, paste
=XLOOKUP(A1&"WS",entity!A2:A20&LEFT(entity!B2:B20,2),entity!C2:C20,"",0,1)
somewhere inside your working sheet to see it more clearly.
Also, note that you used Instr, which would find "WS" anywhere in the string. I used LEFT(value, 2)="WS" to be sure that I matched only the first 2 characters.
I also had to use XLOOKUP instead of VLOOKUP to allow me to use the LEFT(value, 2). If you're using an old version of Excel, you won't have XLOOKUP, unfortunately.
Without VBA
Paste this formula into A2 on your working sheet:
=IFERROR(INDEX(entity!$C$2:$C$20,AGGREGATE(15,3,((entity!$A$2:$A$20&LEFT(entity!$B$2:$B$20,2)=A2&"WS")/(entity!$A$2:$A$20&LEFT(entity!$B$2:$B$20,2)=A2&"WS"))*ROW(entity!$A$2:$A$20)-ROW(entity!$A$1),1)),"")
Then copy that formula down to every row that you want to do a match on. Just to be clearer, it will look like this:
This is a little complex. I based it on an approach in this article, which explains step-by-step how to use INDEX(.. AGGREGATE(..)) for multiple matches. Although it's pretty neat how it works, you may prefer the VBA approach as it is probably easier to maintain.
UPDATE:
I forgot to mention that there is a possibility that the IFERROR() in the formula may slow your spreadsheet down if you have many matches and rows. I also created a version of the formula that doesn't use IFERROR. It uses an IF to first check if there are any TRUE matches first before executing the INDEX.. AGGREGATE. You may not need it, but I've pasted it below just in case it's useful:
=IF(MAX(INT(entity!$A$2:$A$20&LEFT(entity!$B$2:$B$20,2)=A2&"WS"))=1,INDEX(entity!$C$2:$C$20,AGGREGATE(15,3,((entity!$A$2:$A$20&LEFT(entity!$B$2:$B$20,2)=A2&"WS")/(entity!$A$2:$A$20&LEFT(entity!$B$2:$B$20,2)=A2&"WS"))*ROW(entity!$A$2:$A$20)-ROW(entity!$A$1),1)),"")
UPDATE 2:
The statement used in the VBA Evaluate will also work directly as a formula and is much simpler to understand. I realized this when I realized that a single valid match is okay (i.e. we don't need multiple matches):
=XLOOKUP(A2&"WS",entity!$A$2:$A$20&LEFT(entity!$B$2:$B$20,2),entity!$C$2:$C$20,"",0,1)

Copy & Pasting values from one Table to another using VBA and ListObjects

I am trying to compare spending data from two sources: a curated manual input from users and an automated extract, for different business units. The common data from both sources is the ID of the spending.
The idea is to aggregate both data sources (excel Tables) into one Table where the first two columns are the ID of the spending, the next column is the spending data from users related to that ID and the last one is the spending data from automated extract.
In this table, I'll have "double" the total spending for each ID, but then I can do a pivot table where I'll clearly compare the users input with the automated extract for each ID.
I highlighted the important fields I need to copy and paste.
[![PGIvsManual][3]][3]
My code is the following
Sub PGIvsManualInput()
With Application
.ScreenUpdating = False
.EnableEvents = False
End With
Set PGIvsManualTable = Worksheets("PGI vs Dépenses (Auto)").ListObjects("PGIvsManualInputAuto")
Set PGITable = Worksheets("PGI Clean").ListObjects("PGIExtract")
Set ManualInputTable = Worksheets("Dépenses").ListObjects("Dépenses")
'Cleaning the table
With Worksheets("PGI vs Dépenses (Auto)").Range("PGIvsManualInputAuto")
.ClearContents
.Borders(xlInsideHorizontal).LineStyle = xlNone
End With
With PGIvsManualTable
If .ListRows.Count >= 1 Then
.DataBodyRange.Rows.Delete
End If
End With
'Copy the data
PGITable.ListColumns(1).DataBodyRange.Resize(, 2).Copy Destination:= _
PGIvsManualTable
Ant that's where it gets messy. I can't even get the first batch of data to properly import! I am trying to copy the 2 first columns from PGITable and paste them in the 2 first columns of PGIvsManualTable. This worked previously without defining any destination column in my first example, even though both the input and destination Tables didn't have the same number of columns
But in this case, it extends the pasting to all columns of my destination table! I don't understand this comportment as it doesn't happen on my previous example with basically the exact same code!!
I tried to set the destination as follows but always got errors:
PGIvsManualTable.ListColumns(1).DataBodyRange.Resize(, 2) ==> Error 91
PGIvsManualTable.DataBodyRange(1,1) ==> Error 438
PGIvsManualTable.ListColumns(1).Resize(, 2) ==> Error 438
And a few others, but it never worked properly.
I expect the output to be my selected columns copy/pasted properly in my destination column, based on the coordinates I provide in the ListObecjts.DataBodyRange.
I guess that if I manage to make this first import work, all other will work on the same template, but in the meantime, my code seem to work on the previous example.
Deletion of the DataBodyRange.Rows will cause an issue if you then try to paste into the DataBodyRange.
As a workaround, you could delete all rows after the first, something like this example:
Sub Test()
Dim firstTbl As ListObject, secondTbl As ListObject
Set firstTbl = Sheet1.ListObjects("Table1")
Set secondTbl = Sheet1.ListObjects("Table2")
With secondTbl
.DataBodyRange.Clear
If .ListRows.Count > 1 Then
.DataBodyRange.Offset(1).Resize(.ListRows.Count - 1).Rows.Delete
End If
End With
firstTbl.ListColumns(1).DataBodyRange.Resize(, 2).Copy secondTbl.DataBodyRange(1, 1)
End Sub

How can I add a 1 to the most recent, repeated row in Excel?

I have a dataset with 60+ thousand rows in excel and about 20 columns. The "ID column" sometimes repeats itself and I want to add a column that will return 1 only in the row that is the most recent only IF it repeats itself.
Here is the example. I have…
ID DATE ColumnX
AS1 Jan-2013 DATA
AS2 Feb-2013 DATA
AS3 Jan-2013 DATA
AS4 Dec-2013 DATA
AS2 Dec-2013 DATA
I want…
ID DATE ColumnX New Column
AS1 Jan-2013 DATA 1
AS2 Feb-2013 DATA 0
AS3 Jan-2013 DATA 1
AS4 Dec-2013 DATA 1
AS2 Dec-2013 DATA 1
I've been trying with a combination of sort and nested if's, but it depends on my data being always in the same order (so that it looks up the ID in the previous row).
Bonus points: consider my dataset if fairly large for excel, so the most efficient code that won't eat up processor would be appreciated!
An approach you could use is to point MSQuery at your table and use SQL to apply the business rules. On the positive side, this runs very quickly (a couple seconds in my tests against 64k rows). A huge minus is the query engine does not seem to support Excel tables exceeding 64k rows, but there might be ways to work around this. Regardless, I offer the solution in case it gives you some ideas.
To set up first give your data set a named range. I called it MYTABLE. Save. Next select a cell to the right of your table in row 1, and click through Data | From other sources | from Microsoft Query. Choose Excel Files* | OK, browse for your file. The Query Wiz should open, showing MYTABLE available, add all the columns. Click Cancel (really), and click Yes, you want to continue editing.
The MSQuery interface should open, click the SQL button and replace the code with the following. You will need to edit some specifics, such as the file path. (Also, note I used different column names. This was sheer paranoia on my part. The Jet engine is very finicky and I wanted to rule out conflicts with reserved words as I built this.)
SELECT
MYTABLE.ID_X,
MYTABLE.DATE_X,
MYTABLE.COLUMN_X,
IIF(MAXDATES.ID_x IS NULL,0,1) * IIF(DUPTABLE.ID_X IS NULL,0,1) AS NEW_DATA
FROM ((`C:\Users\andy3h\Desktop\SOTEST1.xlsx`.MYTABLE MYTABLE
LEFT OUTER JOIN (
SELECT MYTABLE1.ID_X, MAX(MYTABLE1.DATE_X) AS MAXDATE
FROM `C:\Users\andy3h\Desktop\SOTEST1.xlsx`.MYTABLE MYTABLE1
GROUP BY MYTABLE1.ID_X
) AS MAXDATES
ON MYTABLE.ID_X = MAXDATES.ID_X
AND MYTABLE.DATE_X = MAXDATES.MAXDATE)
LEFT OUTER JOIN (
SELECT MYTABLE2.ID_X
FROM `C:\Users\andy3h\Desktop\SOTEST1.xlsx`.MYTABLE MYTABLE2
GROUP BY MYTABLE2.ID_X
HAVING COUNT(1) > 1
) AS DUPTABLE
ON MYTABLE.ID_X = DUPTABLE.ID_X)
With the code in place MSQuery will complain the query can't be represented graphically. It's OK. The query will execute -- it might take longer than expected to run at this stage. I'm not sure why, but it should run much faster on subsequent refreshes. Once results return, File | Return data to Excel. Accept the defaults on the Import Data dialog.
That's the technique. To refresh the query against new data simply Data | Refresh. If you need to tweak the query you can get back to it though Excel via Data | Connections | Properties | Definition tab.
The code I provided returns your original data plus the NEW_DATA column, which has value 1 if the ID is duplicated and the date is the maximum date for that ID, otherwise 0. This code will not sort out ties if an ID's maximum date is on several rows. All such rows will be tagged 1.
Edit: The code is easily modified to ignore the duplication logic and show most recent row for all IDs. Simply change the last bit of the SELECT clause to read
IIF(MAXDATES.ID_x IS NULL,0,1) AS NEW_DATA
In that case, you could also remove the final LEFT JOIN with alias DUPTABLE.
Sort by ID, then by DATE (ascending). Define entries in new column to be 1 if previous row has the same ID and next row has a different ID or is empty (for last row), 0 otherwise.
It could be done in VBA. I'd be interested to know if this is possible just using formulas, I had to do something similar once before.
Sub Macro1()
Dim rowCount As Long
Sheets("Sheet1").Activate
rowCount = Cells(Rows.Count, 1).End(xlUp).Row
Columns("A:D").Select
Selection.AutoFilter
Range("D2:D" & rowCount).Select
Selection.ClearContents
Columns("A:D").Select
ActiveWorkbook.Worksheets("Sheet1").AutoFilter.Sort.SortFields.Add Key:=Range _
("B1:B" & rowCount), SortOn:=xlSortOnValues
ActiveWorkbook.Worksheets("Sheet1").AutoFilter.Sort.SortFields.Add Key:=Range _
("A1:A" & rowCount), SortOn:=xlSortOnValues
ActiveWorkbook.Worksheets("Sheet1").AutoFilter.Sort.Apply
Dim counter As Integer
For counter = 2 To rowCount
Cells(counter, 4) = 1
If Cells(counter, 1) = Cells(counter + 1, 1) Then Cells(counter, 4) = 0
Next counter
End Sub
So you activate the sheet and get the count of rows.
Then select and autofilter the results, and clear out Column D which has the 0s or 1s. Then filter on the values mbroshi suggested that you say you're already using. Then execute a loop for each record, changing the value to 1, but then back to 0 if the value ahead of it has the same ID.
Depending on your processor I dont think this would take more than a minute or two to run. If you do find something using formulas I would be interested to see it!

Resources