Highlighting duplicate rows - excel

I would like to highlight duplicate rows in Excel VBA. Assume I have the following exemplary table with columns A, B, C and D for testing:
A B C D (Strings)
1 1 1 dsf
2 3 5 dgdgdgdg
1 1 1 dsf
2 2 2 xxx
6 3 4 adsdadad
2 2 2 xxx
The duplicate rows should be highlighted in any colour, e.g. grey. I am looking ideally for fast performing code, as it will be used for rather big tables.
Note there are solutions available for highlighting duplicate cells (but not duplicate rows). I don't know how to identify if rows are duplicates and at the same time how to do that fast, i.e. without nested looping. The solution should be in VBA (not Excel).
What is the best/fastest way to achieve that?

add a conditional formatting with the following sumproduct formula (or a countifs)
=SUMPRODUCT(($A$1:$A$6&$B$1:$B$6&$C$1:$C$6=$A1&$B1&$C1)*1)>1
Explanation:
SUMPRODUCT is handy to work with ranges which you need to manipulate prior to checking a condition. In this case I concatenate A, B & C columns across the range and compare it with the concatenation of the current row. I then convert the TRUE/FALSE array to a 1/0 array by multiplying by 1 and the SUM part of SUMPRODUCT sums the rows where the condition is true, giving me the duplicate rows (all occurences). If you have a small range, using the formula evaluation you can clearly see how this works.
It's a quick fix, but performance is not ideal, I use it a lot for detecting duplicates or generating sequential numbers.
Solution from comments suggested by ponydeer - higher performance
based on sorting suggesting, requires to add key column, put in auto filters and sort on key, then do conditional on key column:

I have tested 3 different approaches on the sample file link from OP's comment. Probably the VBA implementations were not optimal, but below are the results with average time of 100 passes:
1) Conditional formatting using:
a)SUMPRODUCT concatenating columns - 3s
b) COUNTIFS with full column reference - 1.9s
c) COUNTIFS referencing used ranges - 0.2s
2) Sorting the range on all columns, comparing row by row, sorting back - 0.3s
3) Using advanced filter 3.5s
Here is the code for the fastest method:
Sub CF1()
Application.ScreenUpdating = False
Dim sFormula As String
Dim rRng As Range
Dim nCol As Integer, i As Integer
Set rRng = Range("A1").CurrentRegion
nCol = rRng.Columns.Count
'build the formula
sFormula = "=COUNTIFS("
For i = 1 To nCol
sFormula = sFormula & rRng.Columns(i).Address & "," & _
rRng.Cells(1, i).Address(False, True)
If i < nCol Then sFormula = sFormula & ","
Next
sFormula = sFormula & ")>1"
'write the formula in helper cell to get it's local version
rRng.Cells(1, nCol + 1).Formula = sFormula
rRng.FormatConditions.Delete
With rRng.FormatConditions.Add(Type:=xlExpression, _
Formula1:=rRng.Cells(1, nCol + 1).FormulaLocal)
.Interior.ThemeColor = xlThemeColorAccent3
End With
rRng.Cells(1, nCol + 1).Clear
Application.ScreenUpdating = True
End Sub

Sort your range first regarding all columns
Workbooks(1).Sheets(1).Range("A:C").Sort Key1:=Workbooks(1).Sheets(1).Range("A:A"), Order1:=xlAscending, Key2:=Workbooks(1).Sheets(1).Range("B:B"), Order2:=xlAscending, Key3:=Workbooks(1).Sheets(1).Range("C:C"), Order3:=xlAscending, Orientation:=xlSortRows
Then loop through all rows and compare them with the one above them
Dim a As Application
Set a = Application
For i=1 to 1000 ' here you need to set the number of rows you have
if Join(a.Transpose(a.Transpose(ActiveSheet.Rows(i).Value)), Chr(0)) = _
Join(a.Transpose(a.Transpose(Sheets(1).Rows(i+1).Value)), Chr(0)) then
Sheets(1).Range(i+1 & ":" & i+1).EntireRow.Interior.Color = 49407
end if
Next i
The comparison of two rows is based on this thread: How to compare two entire rows in a sheet
Please insert the names of your Workbook, Sheet and set your range and the limits in the code yourself.

I think fastest/best will depend upon the proportion of duplicates – only one row should be quicker than 50% as in the example – and on the actual size of the array (how many columns from which to create a key, etc).
Given that it is rarely possible to beat inbuilt functions with ‘pure’ VBA I suspect using the UI, within VBA if desired, will be faster in some circumstances. Eg:
Add an index column (series fill would serve), copy entire sheet (say to Sheet2), apply Remove Duplicates to all but index column, then apply as CF formula rule of this kind the relevant range of the original sheet:
=$A1=MATCH($A1,Sheet2!$A$1:$A$3000,0)>0
Assuming the start point is like so:
and a ColumnA inserted with numeric series fill starting 1, Sheet2 should look so after Remove Duplicates:
I have assumed ColumnE is to be ignored as far as duplication is concerned.
In source sheet, select array (from A1: - see!), eg A1:I6 and HOME >Styles - Conditional Formatting, New Rule..., Use a formula to determine which cells to format, Format values where this formula is true::
=$A1=MATCH($A1,Sheet2!$A:$A,0)>0
Format..., Fill, grey, OK, OK.
For me results in:

Sub HighlightDuplicateRows_2()
'This one is more modifiable and can handle multiple columns of data
'Just add another *WorksheetFunction.CountIf(Range("A2:A" & LastRow),Cells(r,1).Value) > 1* and change the column values
Dim LastRow As Long
LastRow = Cells(Rows.Count, "A").End(xlUp).Row
For r = 2 To LastRow
If WorksheetFunction.CountIf(Range("A2:A" & LastRow), Cells(r, 1).Value) > 1 And WorksheetFunction.CountIf(Range("B2:B" & LastRow), Cells(r, 2).Value) > 1 Then
Rows(r).Interior.ColorIndex = 6
End If
Next r
End Sub

Related

How to copy the number if contains certain number (first 4 digit) to another column - EXCEL VBA

I'm trying to search on the specific column(E), and if matched with the first 4 digit, I would like to copy the number to a different column.
Column E is where i would like to paste all the random number(dynamic)
Column A/B/C is static where i would add 4 digits from time to time.
Column I/J/K is where is would like to paste the result.
PS:
I'm doing it manually and would really appreciate if someone can help me out with the automation hence no code is provided. :(
Having ExcelO365 means you may use FILTER(). Therefor try the below:
Formula in I2:
=FILTER($E:$E,ISNUMBER(MATCH(--LEFT($E:$E,4),A:A,0)))
Drag right to K2. Now, this is dynamic and will change accordingly upon data entry in column E:E, or changing values in A:C.
this is the code to execute on sheet 1, it goes through the entire column E and validates through the formula of counting if in each of the first three columns and assigns the value found in the corresponding columns.
Sub macro()
Dim Static_Data As String
Dim Sht As Worksheet
Set Sht = ThisWorkbook.Sheets("Hoja1")
Active_row = 2
Do While Sht.Range("E" & Active_row).Value <> ""
Static_Data = Sht.Range("E" & Active_row).Value
For i = 1 To 3
If Application.WorksheetFunction.CountIf(Sht.Columns(i), Mid(Static_Data, 1, 4)) > 0 Then
Sht.Cells(Sht.Cells(Rows.Count, i + 8).End(xlUp).Row + 1, i + 8).Value = Static_Data
End If
Next i
Active_row = Active_row + 1
Loop
End Sub
For Excel versions that don't support FILTER or as an alternative you can use standard formulas for this.
If you use columns F-H as helper columns (and these columns can be hidden) then the formula in F2 will be:
=IF(NOT(ISERROR(VLOOKUP(VALUE(LEFT($E2,4)),A$2:A$100,1,FALSE)))=TRUE,$E2,"")
The formula can then be copied across and down. This will find your matches.
In order to then remove the blanks from the data you can use the following formula in I2 and again copy across and down. Depending on how many numbers you want to add in, you may want to extend the range A$2:A$100 in the top formula and F$2:F$100 in the bottom formula
=IFERROR(INDEX(F$2:F$100,AGGREGATE(15,6,(ROW(F$2:F$100)-ROW(F$2)+1)/(F$2:F$100<>""),ROWS(I$2:I2))),"")

Optimizing Excel CountIfs - can I make it go faster?

I have some larger files I need to validate the data in. I have most of it automated to input the formulas I need automatically. This helps eliminate errors of copy and paste on large files. The problem is with this latest validation.
One of the latest validations involves counting the number of rows that match 3 columns. The 3 columns are in Sheet 2 and the rows to count are in Sheet 1. Then compare this count with an expected number based on Sheet 2. It is easy enough to do with CountIFs, but there are large files and it can take up to an hour on some of them. I am trying to find something faster.
I am using a smaller file and it is still taking about 1 minute. There are only about 1800 rows.
I have something like this:
In Check1 I am using: =COUNTIFS(Sheet1!A:A,A2,Sheet1!B:B,B2,Sheet1!C:C,C2)
My code puts that formula in the active cell. Is there a better way to do this?
Is there anyway - using VB or anything - to improve the performance.
When the rows start getting into the 10's of thousands it is time to start this and get lunch. And, then hope it is done when I get back to my desk!
Thanks.
You basically have to iterate over all rows for each column, this is expensive. You might be able to split this into two tasks:
Merge your Columns A-C into one value =CONCAT(A2,B2,C2)
Then do only a single countif on this column =COUNTIF(D:D,D2)
That way you get rid of two (time) expensive countifs at the cost of the new concat.
You should narrow the range CountIf acts on from entire columns to the actual used range
And your code could write the result of the formula instead of the formula itself
Like follows:
With Sheet1
Set sheet1Rng = Intersect(.UsedRange, .Range("A:C"))
End With
With Sheet2
For Each cell in Intersect(.UsedRange, .Range("A:A"))
cell.Offset(,3) = WorksheetFunction.CountIfs(sheet1Rng.Columns(1), cell.Value, sheet1Rng.Columns(2), cell.Offset(,1).Value, sheet1Rng.Columns(3),cell.Offset(2).Value)
Next cell
End With
I set up a mock sheet, using a layout similar to what you show, with 10,000 rows, and manually filled it with the COUNTIFS formula you show. Changing a single item in the data triggered a recalculation which took about ten seconds or so to execute.
I then tried the following macro, which executed in well under one second. All of the counting is done within the VBA macro. So this Dictionary method may be an answer to your speed problems.
Before running this, you may want to set the Calculation state to Manual (or do it in the code) if you have COUNTIFS on the worksheet.
Option Explicit
'set reference to Microsoft Scripting Runtime
Sub CountCol123()
Dim DCT As Dictionary
Dim V As Variant
Dim WS As Worksheet, R As Range
Dim I As Long
Dim sKey As String
Set WS = Worksheets("sheet2")
'read the info into an array
With WS
Set R = .Range(.Cells(1, 1), .Cells(.Rows.Count, 1).End(xlUp)).Resize(columnsize:=4)
V = R
End With
'Get count of the matches
Set DCT = New Dictionary
For I = 2 To UBound(V, 1)
sKey = V(I, 1) & "|" & V(I, 2) & "|" & V(I, 3)
If DCT.Exists(sKey) Then
DCT(sKey) = DCT(sKey) + 1
Else
DCT.Add Key:=sKey, Item:=1
End If
Next I
'Get the results and write them out
For I = 2 To UBound(V, 1)
sKey = V(I, 1) & "|" & V(I, 2) & "|" & V(I, 3)
V(I, 4) = DCT(sKey)
Next I
'If you have COUNTIFS on the worksheet when testing this,
' or any other formulas that will be triggered,
' then uncomment the next line
'Application.Calculation = xlCalculationManual
With R
.EntireColumn.Clear
.Value = V
End With
End Sub
The Excel alternative named Cell in Hancom Office 2020 is insanely fast at countifs. Not sure why. On my i7-5775C, Excel 2019 takes 90 seconds for a countifs with two criteria for populating 10,000 rows with the results. Using Cell, the exact same operation completes in less than 28 seconds. I have verified that the results match those generated by Excel 2019.

Highlighting differences between duplicates in VBA

Hi I have a spreadsheet with the following columns :
Transaction_ID counter State File_Date Date_of_Service Claim_Status NDC_9 Drug_Name Manufacturer Quantity Original_Patient_Pay_Amount Patient_Out_of_Pocket eVoucher_Amount WAC_per_Unit__most_recent_ RelayHealth_Admin_Fee Total_Voucher_Charge Raw_File_Name
There are duplicate transaction ID's here. Is there VBA that would highlight where there are differences between two rows? So there may be data with the same Transaction ID but I want to highlight where they may have other fields that are different, therefore they aren't truly duplicates and would like to see what information is different.
thanks!
Excel's find duplicates conditional format should suffice for this. The problem is that it only works well off one column.
So there may be data with the same Transaction ID but I want to highlight where they may have other fields that are different, therefore they aren't truly duplicates
So instead of tracking duplicates in the Transaction ID column alone, you can try adding a new column and, in that new column, concatenate all the columns for which the combined values should be unique - and then run Excel's find duplicates conditional format on that column.
For example if the combination of [Transaction_ID], [File_Date] and [NDC_9] should be unique, make a new column that combines [Transaction_ID], [File_Date] and [NDC_9] column values - assuming your data is in an actual table you could have a table formula like so:
=[#Transaction_ID]&[#File_Date]&[#NDC_9]
and would like to see what information is different.
You can then filter the dupes in that column, and then, looking at the other columns you can see how they are different. It's not really possible to be any more specific than that with the way you've worded your question...
Assuming:
It's an unsorted dataset
column 1 contains the repeatable ID
the first row contains headers
...the following code (in the SHeet's module) will turn any cell yellow that has a value that is totally unique for the ID that appears in the leftmost column...
Option Explicit
Public Sub HighlightUniqueValues()
Dim r As Long, c As Long 'row and column counters
Dim LastCol As Long, LastRow As Long 'right-most and bottom-most column and row
Dim ColLetter As String
Dim RepeatValues As Long
'get right-most used column
LastCol = Me.Cells(1, Me.Columns.Count).End(xlToLeft).Column
'get bottom-most used row
LastRow = Me.Cells(Me.Rows.Count, "A").End(xlUp).Row
'assume first column has the main ID
For r = 2 To LastRow 'skip the top row, which presumably holds the column headers
For c = 2 To LastCol 'skip the left-most column, which should contain the ID
'Get column letter
ColLetter = Split(Cells(1, c).Address(True, False), "$")(0)
' Count the number of repeat values in the current
'column associated with the same value in the
'left-most column
RepeatValues = WorksheetFunction.CountIfs(Range("A:A"), Range("A" & r), Range(ColLetter & ":" & ColLetter), Range(ColLetter & r))
' If there is only one instance, then it's a lone
'value (unique for that ID) and should be highlighted
If RepeatValues = 1 Then
Range(ColLetter & r).Interior.ColorIndex = 6 'yellow background
Else
Range(ColLetter & r).Interior.ColorIndex = 0 'white background
End If
Next c
Next r
End Sub
e.g...

Combine Adjacent Row Cells Depending on their Left Column Cell

I'd like to combine cells in the right column into one cell according to the adjacent cell on the left. I tried Merging but I could only get so far. And after searching online I couldn't find anything that can parse each row and combine for the length of the left cell's span. I know it's a CONCATENATE function, but how would I get it to parse the whole spreadsheet?
This is an example of the results I would want for the above:
This may be too complicated - in which case I would go back to the drawing board and do a full VBA version, but initially I was looking for a challenge to construct a solution only using formulas. Unfortunately, there appears to be no standard formula-based approach to concatenate a variable number of cells.
So, to accomplish this, I added one function:
Function CombineRange(ByRef rng As Range, ByVal delim As String)
Dim arr
Dim i As Long
arr = rng.Value
CombineRange = ""
For i = 1 To UBound(arr)
If i > 1 Then
CombineRange = CombineRange & delim
End If
CombineRange = CombineRange & arr(i, 1)
Next i
End Function
Assumptions:
your data is in a sheet called "YourData"
Your merged data is column A
Your "single row" data is column B
Row 1 is some kind of header row.
Next, set up four columns on a new sheet (I call it "Collapsed")
A - Start Row = (first row) whatever row your data starts on (2, in our case)
A - Start Row = (all others) A2+B2
B - Offset = {IFERROR(MATCH(FALSE,ISBLANK(INDIRECT(ADDRESS(A2+1,1,,,"YourData")&":A200")),0),0)}
Note this is an array function, so you need to do shift+Enter when entering it
C - Level1 = =INDEX(YourData!A:A,A2)
D - Combined Level 2 = =IF(B2<=1, INDIRECT(ADDRESS(A2,2,,,"YourData")), CombineRange(INDIRECT(ADDRESS(A2,2,,,"YourData")&":"&ADDRESS(A2+B2-1,2)),"; "))

Copying and pasting nonzero rows and adjacent cells

I am dealing with a lot of GIS metadata that I am importing into Excel, with many rows and columns of blank or zero values. I am trying to take data like this:
(Left Column is the name, columns to the right are the values associated with that name)
and by only selecting the columns that have a value, end up with a new set of tables showing the nonzero rows and its corresponding name to the left:
I have tried doing it by filtering the data in a table so that it only shows nonzero values, copying that column and the far left column and pasting it onto a new sheet. This is easy if I only have a few columns, but given the amount I have it is very painstaking. I have to filter each column separately because the rows in each column may or may not have blanks or zeros depending on the column.
Can the LOOKUP function be used for this, or would using VBA be better?
why can't use 2 simple loops (rows, columns)??
i suppose that the data start from "A1", I count rows and columns and play this code.
Sub tras()
Dim lastRw As Integer
Dim lastCol As Integer
Dim ResultRow As Integer
ResultRow = 20 '1th row for result
lastRw = Range("A1").End(xlDown).Row
lastCol = Cells(1, ActiveSheet.Columns.Count).End(xlToLeft).Column
Set rgn = Range("A1", Cells(lastRw, lastRw))
For x = 2 To lastCol
For y = 1 To lastRw
If Cells(y, x) <> "" Then
Cells(ResultRow, 1) = Cells(y, 1).Value
Cells(ResultRow, 2) = Cells(y, x).Value
ResultRow = ResultRow + 1
End If
Next y
ResultRow = ResultRow + 1
Next x
End Sub
A filtering in a similar case can be performed with array formulas such as in the example below.
See an example:
http://i.stack.imgur.com/YlyUN.jpg
(I cannot post images yet - please edit my post so that the image is integrated)
To use it select range E1:E6, write the shown formula and press ctrl+shift+enter.
The logic is as follows:
1) based on range B1:B6 create two vectors: 1 with row numbers of each row which simply is a v1 = {1,2,3,4,5,6} in this case and a vector of TRUEs and FALSEs for each row with TRUE if a row is nonempty and FALSe if empty: in this case it is v2 = {TRUE,FALSE,TRUE,TRUE,FALSE,FALSE}.
2) multiply v1 and v2 element wise which gives us vector v3 = {1,0,3,4,0,0}
3) using SMALL function extract values from the smallest to the largest using COUNTBLANK function to skip all the zeros (their number is equal to number of blanks) - ROW function used on range E1:E6 serves as our iterator in SMALL function
4) after point 3) you end up with row numbers of non-empty cells based on range B1:B6, now you have to call INDEX function to extract values from range A1:A6
5) add IFERROR on top of everything so that it returns "" when going out of SMALL's input array's range
To assign values in column F simply use INDEX + MATCH (I encourage everyone to forget about *LOOKUP).
What you end up here with is something that I think addresses your core problem. If you want to make it all "draggable" or "fillable", use it on ranges other that beginning with 1 or put it all in setup row after row rather than column after column, you'll have to make some modifications to the formulas used here but the logic will be the same.

Resources