Comparing multiple columns in Excel and remove dups

Comparing multiple columns in Excel and remove dups - excel

I have 3 columns in Excel 2010 with email addresses, I need to be able to narrow all 3 columns to only have unique values. I don't necessarily need to merge the remaining values into a single column, but I definitely need to eliminate duplicates. I found another post that had a VB with it, but it didn't seem to work. It removed only a few duplicates:
Sub removeDuplicates()
Dim lastCol As Integer
lastCol = 5 'col 5 is column E
Dim wks As Worksheet
Set wks = Worksheets("Sheet1")
Dim searchRange As Range
Set searchRange = wks.Range("A1:A" & wks.Cells(Rows.Count, "A").End(xlUp).Row)
Dim compareArray As Variant
Dim searchArray As Variant
'Get all values from Col A to search against
compareArray = searchRange.Value
For col = lastCol - 1 To 1 Step -1
'Set values to search for matches
searchArray = searchRange.Offset(0, col - 1).Value
'Set values to last column to compare against
compareArray = searchRange.Offset(0, col).Value
For i = 1 To UBound(compareArray)
If compareArray(i, 1) = searchArray(i, 1) Then
'Match found, delete and shift left
Cells(i, col).Delete Shift:=xlToLeft
End If
Next i
Next col
End Sub
Thanks!

Here is how I would propose doing this if it is a one-off task that you don't have to do very often.
Rather than typing out the entire process in detail, I have done a screencast of how I did this (and the entire process barely took me a minute to do).
The quick overview:
You will need to add a few temporary helper columns for unique values from each email list (one for each list), a 'merged list' column and then a final column. Filter for the unique emails using the 'Advanced' filter option one column at a time. Paste those values into the temporary column for that email list and then clear the filter. Repeat until you have gone through each column and each temporary column has the unique values in it from each list. Once you have the uniques from each list, paste these one at a time into the 'merged list' column (stacking the results in one long list) and then do a unique filter on that. Copy/paste the uniques from that list into your final column, clear the filter, and you're done.
Screencast is below:
http://screencast.com/t/zL8VmUut
Cheers!

Since the first column are the ones you already contacted, swap the first column with the second and on the 3rd write a YES or NO value if email was found on the second column (the ones you already contacted).
Formula.
=IF(ISERROR(VLOOKUP(A2,$B$2:$B$11,1,FALSE)),"Not Contacted","Yes")
As you can see, the one with Yes status is on the contacted list, you just filter the Not Contacted and you will have a new pending list in column A.
Simple.

Related

Excel remove duplicates based on 2 columns case-sensitive

I need to remove duplicates from an Excel worksheet based on the values in 2 columns while taking case into account.
In the example below, Rows 1 and 2 are duplicates (Row 2 should be removed). Row 3, 4, and 5 are unique.
Row
Column A
Column B
1
Abc
Def
2
Abc
Def
3
ABC
DEF
4
ABC
DeF
5
Abc
DeF
I've done this with other datasets using Data > Remove duplicates, but since it is case-insensitive, it won't work for this.
I also found this question, which is very similar, but only identifies duplicates based on 1 column.
(How to remove duplicates that are case SENSITIVE in Excel (for 100k records or more)?)

Try this code:
Sub SubRemoveDuplicates()
'Declarations.
Dim RngData As Range
Dim RngDataToBeCompared
Dim RngCell As Range
Dim RngRow As Range
'Settings.
Set RngData = Range("A1:C6")
Set RngDataToBeCompared = Range("B2:C6")
'Covering each row of the data to be compared.
For Each RngRow In RngDataToBeCompared.Rows
CP_Rerun_For:
'Covering each cell of the given row.
For Each RngCell In RngRow.Cells
'Checking if any cell is different from the one under it.
If RngCell.Value <> RngCell.Offset(1, 0).Value Then
'If said cell has been found, skip to the next row.
GoTo CP_Next_Row
End If
Next
'Checking if the range to be targeted is within RngData.
If Not Intersect(RngRow.Offset(1, 0).EntireRow, RngData) Is Nothing Then
'Deleting the row of duplicates.
Intersect(RngRow.Offset(1, 0).EntireRow, RngData).Delete (xlShiftUp)
'Rerunning this cycle for the given row in order to catch duplicates that comes in more than 2.
GoTo CP_Rerun_For
End If
CP_Next_Row:
Next
End Sub
Note: if you are going to cover an entire column with presumably many empty cells, the macro will cover (and eventually delete) all those empty cells too. The macro can be modified so it will stop when it encounters and empty row, or to dynamically determinate the appropriate range to be covered. Otherwise it might take more time than necessary.

I don't like using macros until it's last hope.
For your situation, I would suggest adding new columns, and with function
=lower(Column A), etc. get values of column A in lower case. Then I would add one more new column and do the same for Column B.
And after that, I would use Data/Remove Duplicates (converting range to Table format first). And then I would delete unnecessary columns which were added for converting everything to lowercase.

Use this frmula to manually delete. It combines two columns on one row and compares them with the column above.
=B2&C2=B1&C1
You can then edit or filter on Col D and delete.

Sorting places my data with empty cells above it

I have written a bunch of VBA macros to get my data formatted how I need it, and the last step is to sort by this new column I have generated in ascending order. However, when I hit sort by the new column, the code now places all the empty cells above my newly generated column as I think it is reading the empty as a 0 and sorts it above any alphanumeric data. This is happening because of the UDF I have for sorting the data. I need to insert the new column with the UDF for each new cell that I insert, but I don't know how to define the range in the new column.
I am close to solving this but would love some help.
Essentially what I have tried for placing the data in a new column works, but the way I have set the range is placing it in a bad spot and it can easily be sorted in the wrong order now. I include all of my code, but the issue is in the last portion of it where I am setting a range to place the new data.
I think what is happening is when I set my range from C3-C2000 and populate it, the remaining empty cells are now included in my sort and give me "lower" numbers when I sort it ascending. Thus all the empty cells are ranked higher up in the column.
Option Explicit
Sub ContractilityData()
Dim varMyItem As Variant
Dim lngMyOffset As Long, _
lngStartRow As Long, _
lngEndRow As Long
Dim strMyCol As String
Dim rngCell As Range
Columns("B:B").Insert Shift:=xlToRight, CopyOrigin:=xlFormatFromLeftOrAbove 'make new column for the data to go
lngStartRow = 3 'Starting row number for the data. Change to suit
strMyCol = "A" 'Column containing the data. Change to suit.
Application.ScreenUpdating = False
For Each rngCell In Range(strMyCol & lngStartRow & ":" & strMyCol & Cells(Rows.Count, strMyCol).End(xlUp).Row)
lngMyOffset = 0
For Each varMyItem In Split(rngCell.Value, "_") 'put delimiter you want in ""
If lngMyOffset = 2 Then 'Picks which chunk you want printed out (each chunk is set by a _ currently)
rngCell.Offset(0, 1).Value = varMyItem
End If
lngMyOffset = lngMyOffset + 1
Next varMyItem
Next rngCell
Application.ScreenUpdating = True
'Here is where my problem arises
Range("C:C").EntireColumn.Insert
Dim sel As Range
Set sel = Range("C3:C2000")
sel.Formula = "=PadNums(B3,3)"
MsgBox "Data Cleaned"
End Sub
What I would like instead is a way to insert a new column, then have my UDF "PadNums" populate each cell up to the last cell of the previous column, essentially re-naming all my data from the previous column. I can then sort by the new column in ascending order and my data is in the correct order.
I think perhaps what I should do is copy column B into my newly inserted column C, then use some sort of last row function to apply the formula in all cells. That would give me the appropriate range always based on my original column?

I solved this! What I did was use range and xlDown to last row on column B, then pasted it to C, then inserted my UDF into C using the xlDown range!

Highlighting differences between duplicates in VBA

Hi I have a spreadsheet with the following columns :
Transaction_ID counter State File_Date Date_of_Service Claim_Status NDC_9 Drug_Name Manufacturer Quantity Original_Patient_Pay_Amount Patient_Out_of_Pocket eVoucher_Amount WAC_per_Unit__most_recent_ RelayHealth_Admin_Fee Total_Voucher_Charge Raw_File_Name
There are duplicate transaction ID's here. Is there VBA that would highlight where there are differences between two rows? So there may be data with the same Transaction ID but I want to highlight where they may have other fields that are different, therefore they aren't truly duplicates and would like to see what information is different.
thanks!

Excel's find duplicates conditional format should suffice for this. The problem is that it only works well off one column.
So there may be data with the same Transaction ID but I want to highlight where they may have other fields that are different, therefore they aren't truly duplicates
So instead of tracking duplicates in the Transaction ID column alone, you can try adding a new column and, in that new column, concatenate all the columns for which the combined values should be unique - and then run Excel's find duplicates conditional format on that column.
For example if the combination of [Transaction_ID], [File_Date] and [NDC_9] should be unique, make a new column that combines [Transaction_ID], [File_Date] and [NDC_9] column values - assuming your data is in an actual table you could have a table formula like so:
=[#Transaction_ID]&[#File_Date]&[#NDC_9]
and would like to see what information is different.
You can then filter the dupes in that column, and then, looking at the other columns you can see how they are different. It's not really possible to be any more specific than that with the way you've worded your question...

Assuming:
It's an unsorted dataset
column 1 contains the repeatable ID
the first row contains headers
...the following code (in the SHeet's module) will turn any cell yellow that has a value that is totally unique for the ID that appears in the leftmost column...
Option Explicit
Public Sub HighlightUniqueValues()
Dim r As Long, c As Long 'row and column counters
Dim LastCol As Long, LastRow As Long 'right-most and bottom-most column and row
Dim ColLetter As String
Dim RepeatValues As Long
'get right-most used column
LastCol = Me.Cells(1, Me.Columns.Count).End(xlToLeft).Column
'get bottom-most used row
LastRow = Me.Cells(Me.Rows.Count, "A").End(xlUp).Row
'assume first column has the main ID
For r = 2 To LastRow 'skip the top row, which presumably holds the column headers
For c = 2 To LastCol 'skip the left-most column, which should contain the ID
'Get column letter
ColLetter = Split(Cells(1, c).Address(True, False), "$")(0)
' Count the number of repeat values in the current
'column associated with the same value in the
'left-most column
RepeatValues = WorksheetFunction.CountIfs(Range("A:A"), Range("A" & r), Range(ColLetter & ":" & ColLetter), Range(ColLetter & r))
' If there is only one instance, then it's a lone
'value (unique for that ID) and should be highlighted
If RepeatValues = 1 Then
Range(ColLetter & r).Interior.ColorIndex = 6 'yellow background
Else
Range(ColLetter & r).Interior.ColorIndex = 0 'white background
End If
Next c
Next r
End Sub
e.g...

Remove columns based on their name (first value)

I'm using macros to quickly search a large table of student data and consolidate it into a single cell for use in seating plans (I'm a teacher). Most of it works but I have a problem with selecting just the data I need.
Steps:
1. Remove data.
2. Run a formula to check if students fit into particular groups and consolidate their information
3. Format
Different subjects and year groups have different layouts for their data and so this step is causing me problems. I've tried using absolute cell references in step 2 but this doesn't work as sometimes the information that should be in column D is in column E etc.
What I want to be able to do is have a macro that checks the first value in the column (ie the title) and if it doesn't match one of a predetermined list delete the whole column along with it's data.
Dim rng As Range
For Each rng In Range("everything")
If rng.Value = "Test" Or rng.Value = "Test1" Then
rng.EntireColumn.Hidden = True
End If
I think I could use something like this if I could change the output from hiding columns to deleting them?

re: What I want to be able to do is have a macro that checks the first value in the column (ie the title) and if it doesn't match one of a predetermined list delete the whole column along with it's data.
To delete all columns NOT WITHIN the list:
Sub del_cols()
Dim c As Long, vCOL_LBLs As Variant
vCOL_LBLs = Array("BCD", "CDE", "DEF")
With Worksheets("Sheet7") '<~~ set this worksheet reference properly!
For c = .Cells(1, Columns.Count).End(xlToLeft).Column To 1 Step -1
If IsError(Application.Match(.Cells(1, c), vCOL_LBLs, 0)) Then
.Columns(c).Delete
End If
Next c
End With
End Sub
To delete all columns WITHIN the list:
Sub del_cols()
Dim v As Long, vCOL_LBLs As Variant
vCOL_LBLs = Array("BCD", "CDE", "DEF")
With Worksheets("Sheet7") '<~~ set this worksheet reference properly!
For v = LBound(vCOL_LBLs) To UBound(vCOL_LBLs)
Do While Not IsError(Application.Match(vCOL_LBLs(v), .Rows(1), 0))
.Cells(1, Application.Match(vCOL_LBLs(v), .Rows(1), 0)).EntireColumn.Delete
Loop
Next v
End With
End Sub

Speeding up a search and delete macro

I have a list containing three columns. The first column contains Names and the other two columns have numbers. The macro takes the first name(A1) and then searches down column A for another occurrence.
When it finds it, it deletes the entire row.It then goes to A2 and does the same thing agan. It works ok for about 500 entries, but using 3000 entries slows it down considerably. Is there a way to speed up this code?
Sub Button1_DeleteRow()
Dim i As Integer
Dim j As Integer
Dim Value As Variant
Dim toCompare As Variant
For i = 1 To 3000
Value = Cells(i, 1)
For j = (i + 1) To 3000
toCompare = Cells(j, 1)
If (StrComp(Value, toCompare, vbTextCompare) = 0) Then
Rows(j).EntireRow.Delete
End If
Next j
Next i
End Sub

If you are running xl07/10 then you can do this with a single line with Remove Duplicates. If you are running 03 then a solution with AutoFilter will be most efficient (I can provide this if you are on the older version)
Remove Duplicates
Manually
Select column A
Data .... Remove Duplicates
Expand selection
Select only column A to find duplicates on
Code
ActiveSheet.Range("$A$1:$A$3000").EntireRow.RemoveDuplicates Columns:=1, Header:=xlNo

To supplement #brettdj's answer, if you are running Excel 2003, you can do this using AdvancedFilter as follows:
Range("A1:A11").AdvancedFilter Action:=xlFilterInPlace, Unique:=True
Note: AdvancedFilter assumes that the first row of your range (row A in this example) contains column headers and will not include that row in the filtering.
To do this manually: Data > Filter > Advanced Filter... > Unique records only

Using Bretts technique is a good answer: but to answer your question about why does it take so long:
- Your macro is getting a value from over 4 million cells one by one. This is very slow.
- I don't see that your macro has switched off screenupdating and automatic calculation: every time a row is deleted the screen will refresh and Excel will recalculate. If you have not switched these off it is very slow.
This code should run a lot faster
Option Explicit
Sub Button1_DeleteRow()
Dim i As Long
Dim j As Long
Dim vArr As Variant
Dim iComp As Long
Dim Deletes(1 To 3000) As Boolean
Application.ScreenUpdating = False
iComp = Application.Calculation
Application.Calculation = xlCalculationManual
vArr = Range("a1:A3000")
For i = 1 To 3000
For j = (i + 1) To 3000
If (StrComp(vArr(i, 1), vArr(j, 1), vbTextCompare) = 0) Then
Deletes(j) = True
End If
Next j
Next i
For j = 3000 To 1 Step -1
If Deletes(j) Then Rows(j).EntireRow.Delete
Next j
Application.ScreenUpdating = True
Application.Calculation = iComp
End Sub

Sorting the data on column A would then make it trivial to identify and remove the duplicates in a single pass
In response to the comment below, I'll explain why sorting is a useful technique.
By sorting column A into order, duplicate removal simply becomes a matter of comparing adjacent entries in column A. You can then either delete the duplicate rows as you find them or flag them for later deletion.
The process should actually be a lot less tedious as you only have to sort the list (and sorting, being built-in, tends to be very fast) and then do one pass (instead of 4498500) through the list deleting/flagging as you go (obviously you need a subsequent clean-up pass if you go for flagging).
On the issue of changing the order of the list, start by adding an extra column (e.g. column D) and have D2 contain the value 2 (i.e. just the row number). A quick fill-down later and every row is numbered. After sorting and deleting/flagging, restoring the original order is just a matter of re-sorting on column D which could then be deleted.
I use this method when I have to perform some operation or other on the duplicates. In other words, column A has duplicate values but the values in columns B and C are meaningful (for example, I might want to sum these values from all of the entries relating to the specific value of column A). In many cases, however, it would be easier just to use SQL to achieve the same result

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string