Removing duplicates from large sheet

Removing duplicates from large sheet - excel

I want to remove rows based on duplicate cells in a column from large sheet, without leaving duplicate sample (like "Remove Duplicates" Excel command does). So if I have:
1
2
2
3
I want, as a result:
1
3
This can be accomplished with conditional formatting, then filtering or sorting duplicates and deleting filtered data, but the process is very slow for large sheet.
Conditional formatting takes second, but just clicking on filter takes around 5min to display filter context menu and additional 20-30min to do actual filtering based on color. I tried this process on a different PCs with 4 cores and plenty of RAM and 100.000 rows sheet
I then thought to write VBA, iterate column cells and if cell is colored, then delete entire row (this is possible in Excel 2010, with Cells().DisplayFormat) but processing takes even more time.
Can someone suggest a faster way to remove duplicates on large sheet?

EDIT: Note that I have used 2 functions. Of this, test is a function to test whether the function works (which you will have to modify per your scenario).
Also, I filled cell A1 to A100000 with test values. Please modify it per your needs.
Option Explicit
Function GetUniqueItems(ByVal src As Range) As Variant
Dim returnValue
Dim dictOfItemsWith1Value
Dim dictOfItemsWithMoreThan1Value
Dim countOfCells As Long
Dim counter As Long
Dim srcValues As Variant
Dim currentValue
Dim cell As Range
srcValues = src.Value
countOfCells = src.Cells.Count
Set dictOfItemsWith1Value = CreateObject("Scripting.Dictionary")
Set dictOfItemsWithMoreThan1Value = CreateObject("Scripting.Dictionary")
For counter = 1 To countOfCells
currentValue = srcValues(counter, 1)
If dictOfItemsWithMoreThan1Value.exists(currentValue) Then
dictOfItemsWithMoreThan1Value(currentValue) = dictOfItemsWithMoreThan1Value(currentValue) + 1
Else
If Not dictOfItemsWith1Value.exists(currentValue) Then
dictOfItemsWith1Value.Add currentValue, 1
Else
dictOfItemsWith1Value.Remove currentValue
dictOfItemsWithMoreThan1Value.Add currentValue, 1
End If
End If
Next
ReDim returnValue(1 To dictOfItemsWith1Value.Count, 1 To 1)
Dim key
counter = 1
For Each key In dictOfItemsWith1Value.keys
returnValue(counter, 1) = key
counter = counter + 1
Next
GetUniqueItems = returnValue
End Function
Sub test()
Debug.Print Now
Dim uniqueValues
uniqueValues = GetUniqueItems(Range("A1:A100000"))
Range("A1:A100000").ClearContents
Range("A1").Resize(UBound(uniqueValues, 1)) = uniqueValues
Debug.Print Now
End Sub

My way to deal with large excel files where I have to remove large chunks of data:
After the last column, use a countif() (much like KazJaw and DanM's countif)
=COUNTIF($A$1:$A$100000,A1)
$A$1:$A$100000 contains your ids. Change accordingly.
Drag the formula to the bottom (Fill Down, or select the range $B$1:$B$100000 if this is the column you put the helper column then Ctrl+D)
Copy column and paste values in place to remove the formula. This will prevent any recalculations during/after any filtering.
Sort by the column with the counts. This makes deleting the large portion of rows much faster later on.
Look for where you start to get counts of 2 and delete all rows till bottom.
Delete the helper column.
Now, if you want to restore the original order, put yet another column after the count, after step 3 above, and after step 5, sort this new column by ascending order before deleting it in step 6.

If you data is in located in column A, this formula should do what you need fairly efficiently:
=COUNTIF(A$1:A$100000,A1)
This formula counts how many times the value in A1 appears in the range A1:A100000. (The dollar signs keep that range from moving down as your drag your formula down.)
Place this in B1 and drag down* to B100000 (assuming you have 100,000 rows).
Then just do a filter on column B to show only 1. (More than 1 means you have duplicates and shouldn't show it.)
*A short cut for dragging down is to just select B1, then press Ctrl-End, then hold down shift and click B100000. Then do Ctrl-D (which is a shortcut for Fill Down).

Related

Excel VBA: How to swap two selected cell ranges (not only two values) within the same column?

I would like to swap selected cell ranges within the same column without having automatically adjusted attached formulas in other columns. Those cell ranges will almost always be of unequal size.
I found a VBA code which does it for two selected cells, but im afraid that this wont help me much.
Sub SwapCells()
Dim sHolder As String
If Selection.Cells.Count = 2 Then
With Selection
sHolder = .Cells(1).Formula
If .Areas.Count = 2 Then ' Cells selected using Ctrl key
.Areas(1).Formula = .Areas(2).Formula
.Areas(2).Formula = sHolder
Else ' Adjacent cells are selected
.Cells(1).Formula = .Cells(2).Formula
.Cells(2).Formula = sHolder
End If
End With
Else
MsgBox "Select only TWO cells to swap", vbCritical
End If
End Sub
I know that another option would be to hold 'shift' when moving the cell ranges (works perfectly fine), but then all the attached formulas will change their reference which I dont want (e.g. if I have a formula referring to cell A1, and im swapping A1 somewhere, the formula will refer to A1's new position, but I want the formula to still refer to A1).
I think another option would be to use INDIRECT("G" & ROW()) to fix it, but since its a quite resource-intensive formula, Id love to see an alternative.
On top of that, the latter two options would not allow me to use tables (which Id prefer for other reasons) because you cant swap cells in tables. This is why Id strongly prefer a VBA option.
I hope you can help me, thank you! Maybe it is only necessary to adjust the VBA code a little.
Kind regards,
Marco
EDIT: If it is significantly easier to swap two equal cell ranges (e.g. encompassing 5 cells each), then it would also be a good solution.

Sub SwapTwoSelectedRanges()
Dim initialRng As Range
Set initialRng = Selection
If initialRng.Areas.Count <> 2 Then
Debug.Print "Select 2 areas!"
Exit Sub
End If
If initialRng.Areas(1).Cells.Count <> initialRng.Areas(2).Cells.Count Then
Debug.Print "The cells should be the same number!"
Exit Sub
End If
Dim intermediateRng As Variant
intermediateRng = initialRng.Areas(1).Cells.Value2
initialRng.Areas(1).Cells.Value2 = initialRng.Areas(2).Cells.Value2
initialRng.Areas(2).Cells.Value2 = intermediateRng
End Sub
Swaping two values is considered an easy task, if you are using an intermediate value. With the ranges, there are two important checks to perform, before swapping them:
Are the selected areas exactly 2;
Is the number of cells equal in every area;
Then with an intermediateRng as a 3. variable, the swap is made;
This would only work, if the Areas are per column. If the selection is made per row, then the results would not be as expected;
Concerning the swaping of the colors, if the colors of all the cells per area are exactly the same, this would work:
Dim intermediateRng As Variant
Dim intermediateClr As Variant
intermediateRng = initialRng.Areas(1).Cells.Value2
intermediateClr = initialRng.Areas(1).Cells.Interior.Color
With initialRng
.Areas(1).Cells.Value2 = .Areas(2).Cells.Value2
.Areas(1).Cells.Interior.Color = .Areas(2).Cells.Interior.Color
.Areas(2).Cells.Value2 = intermediateRng
.Areas(2).Cells.Interior.Color = intermediateClr
End With
However, if the colors of the cells per Area are not the same, then the easiest way is to copy+paste the first range to a separate range and work from there.

Dynamic Summing Range

Currently I have a medical spread-sheet with a list of clients that we have serviced. We have 8 different clinical categories which are denoted by different acronyms - HV,SV,CV,WV,CC,OV,TS and GS.
A client can receive multiple therapies i.e. HV,SV,CV - in the background we have a counter mechanism which would increment each of these records by 1.The formula used for this counter is:
=(LEN('Parent Sheet'!F25)-LEN(SUBSTITUTE('Parent Sheet'!F25,'Parent Sheet'!$P$4,"")))/LEN('Parent Sheet'!$P$4)
At the bottom of the sheet we then have a sum which ads up all the treatments that occurred for that week.
Now the tricky part about this is that we have almost a year's worth of data in this sheet but the summing formulas are set as: SUM(COLUMN 6: COLUMN 53) but due to a need to increase the entries beyond this limit, we have to adjust the sum formula. We have 300 SUM Formulas adding up each of the 8 Criteria items and assigning them to the HV,SV,SC,WV etc. counters.
Would we have to adjust this manually one by one or is there a easier way of doing this?
Thank you very much!

To me, I think you should change the sheet layout a little, create a User Defined Function (UDF) and alter the formulas in your Sum rows for efficient row/column adding (to make use of Excel's formula fill). The only issue is that you need to save this as a Macro-Enabled file.
What you need to change in the formulas is to utilize $ to restrict changes in column and rows when the formula fill takes place.
To illustrate in an example, consider:
Assuming the first data starts at row 6, and no more than row 15 (you can use the idea of another data gap on the top). Alter the Sum row titles to begin with the abbreviation then create a UDF like below:
Option Explicit
' The oRngType refers to a cell where the abbreviation is stored
' The oRngCount refers to cells that the abbreviation is to be counted
' Say "HV" is stored in $C16, and the cells to count for HV is D$6:D$15,
' then the sum of HV for that date (D16) is calculated by formula
' `=CountType($C16, D$6:D$15)`
Function CountType(ByRef oRngType As Range, ByRef oRngCount) As Long
Dim oRngVal As Variant, oVal As Variant, oTmp As Variant, sLookFor As String, count As Long
sLookFor = Left(oRngType.Value, 2)
oRngVal = oRngCount.Value ' Load all the values onto memory
count = 0
For Each oVal In oRngVal
If Not IsEmpty(oVal) Then
For Each oTmp In Split(oVal, ",")
If InStr(1, oTmp, sLookFor, vbTextCompare) > 0 Then count = count + 1
Next
End If
Next
CountType = count
End Function
Formulas in the sheet:
Columns to sum are fixed to rows 6 to 15 and Type to lookup is fixed to Column C
D16 | =CountType($C16,D$6:D$15)
D17 | =CountType($C17,D$6:D$15)
...
E16 | =CountType($C16,E$6:E$15)
E17 | =CountType($C17,E$6:E$15)
The way I created the UDF is to lookup and count appearances of a cell value (first argument) within a range of cells (second argument). So you can use it to count a type of treatment for a big range of cells (column G).
Now if you add many columns after F, you just need to use the AutoFill and the appropriate rows and columns will be there.
You can also create another VBA Sub to add rows and columns and formulas for you, but that's a different question.

It's isn't a great idea to have 300 sum formulas.
Name your data range and include that inside the SUM formula. So each time the NAMED data range expands, the sum gets calculated based on that. Here's how to create a dynamic named rnage.
Sorry I just saw your comment. Following is a simple/crude VBA snippet.
Range("B3:F12") is rangeValue; Range("C18") is rngTotal.
Option Explicit
Sub SumAll()
Dim WS As Worksheet
Dim rngSum As Range
Dim rngData As Range
Dim rowCount As Integer
Dim colCount As Integer
Dim i As Integer
Dim varSum As Variant
'assuming that your said mechanism increases the data range by 1 row
Set WS = ThisWorkbook.Sheets("Sheet2")
Set rngData = WS.Range("valueRange")
Set rngSum = WS.Range("rngTotal")
colCount = rngData.Columns.Count
'to take the newly added row (by your internal mechanism) into consideration
rowCount = rngData.Rows.Count + 1
ReDim varSum(0 To colCount)
For i = 0 To UBound(varSum, 1)
varSum(i) = Application.Sum(rngData.Resize(rowCount, 1).Offset(, i))
Next i
'transpose variant array with totals to sheet range
rngSum.Resize(colCount, 1).Value = Application.Transpose(varSum)
'release objects in the memory
Set rngSum = Nothing
Set rngData = Nothing
Set WS = Nothing
Set varSum = Nothing
End Sub
Screen:

You can use named ranges as suggested by bonCodigo or you could use find and replace or you can insert the columns within the data range and Excel will update the formula for you automatically.

Link selective rows from one sheet to the next

So I've searched everywhere... I have an office 2007 excel spreadsheet with two pages, one labeled "i" and the other "t." I need to display selective rows (those rows that have a value in column A--not all do) from "i" in "t." I also need the rows in the "t" page to be in numerical order. I figured out how to do it across all rows, but not how to selectively add rows with values only in column A. Further, when I add new rows to "i," "t" doesn't automatically update. Any advice on how to accomplish this would be of immense help!
I have access to office 2010. I don't know if that makes the coding easier?
Thank you!
Jason

You could try a user-defined function like this:
Function NonBlank(Selection As Range, Index As Integer) As Variant
Dim Count As Integer
Count = 0
For Each cell In Selection
If Len(cell.Value) > 0 Then Count = Count + 1
If Index = Count Then
NonBlank = cell.Value
Exit For
End If
Next
End Function
Then on Sheet t, you can put =NonBlank(i!A:A,ROW(A1)) in the first cell where you want the first non-blank value of Sheet i, and then copy the formulas down.

Excel Lookup return multiple values horizontally while removing duplicates

I would like to do a vertical lookup for a list of lookup values and then have multiple values returned into columns for each lookup value. I actually managed to do this after a long Google search, this is the code:
=INDEX(Data!$H$3:$H$70000, SMALL(IF($B3=Data!$J$3:$J$70000, ROW(Data!$J$3:$J$70000)-MIN(ROW(Data!$J$3:$J$70000))+1, ""), COLUMN(A$2)))
Now, my problem is, as you can see in the formula, my lookup range contains 70,000 rows, which means a lot of return values. But most of these return values are double. This means I have to drag above formula over many columns until all lookup values (roughly 200) return #NUM!.
Is there any possible way, I guess VBA is necessary, to return the values after duplicates have been removed? I'm new at VBA and I am not sure how to go about this. Also it takes forever to calculate having so many cells.

[Edited]
You can do what you want with a revised formula, not sure how efficient it will be with 70,000 rows, though.
Use this formula for the first match
=IFERROR(INDEX(Data!$H3:$H70000,MATCH($B3,Data!$J3:$J70000,0)),"")
Now assuming that formula in in F5 use this formula in G5 confirmed with CTRL+SHIFT+ENTER and copied across
=IFERROR(INDEX(Data!$H3:$H70000,MATCH(1,($B3=Data!$J3:$J70000)*ISNA(MATCH(Data!$H3:$H70000,$F5:F5,0)),0)),"")
changed the bolded part depending on location of formula 1
This will give you a list without repeats.....and when you run out of values you get blanks rather than an error

Not sure if you're still after a VBA answer but this should do the job - takes about 25 seconds to run on my machine - it could probably be accelerated by the guys on this forum:
Sub ReturnValues()
Dim rnSearch As Range, rnLookup As Range, rnTemp As Range Dim varArray
As Variant Dim lnIndex As Long Dim strTemp As String
Set rnSearch = Sheet1.Range("A1:A200") 'Set this to your 200 row value range
Set rnLookup = Sheet2.Range("A1:B70000") 'Set this to your lookup range (assume 2
columns)
varArray = rnLookup
For Each rnTemp In rnSearch
For lnIndex = LBound(varArray, 1) To UBound(varArray, 1)
strTemp = rnTemp.Value
If varArray(lnIndex, 1) = strTemp Then
If WorksheetFunction.CountIf(rnTemp.EntireRow, varArray(lnIndex, 2)) = 0 Then 'Check if value exists already
Sheet1.Cells(rnTemp.Row, rnTemp.EntireRow.Columns.Count).End(xlToLeft).Offset(0, 1).Value =
varArray(lnIndex, 2)
End If
End If
Next Next
End Sub

Speeding up a search and delete macro

I have a list containing three columns. The first column contains Names and the other two columns have numbers. The macro takes the first name(A1) and then searches down column A for another occurrence.
When it finds it, it deletes the entire row.It then goes to A2 and does the same thing agan. It works ok for about 500 entries, but using 3000 entries slows it down considerably. Is there a way to speed up this code?
Sub Button1_DeleteRow()
Dim i As Integer
Dim j As Integer
Dim Value As Variant
Dim toCompare As Variant
For i = 1 To 3000
Value = Cells(i, 1)
For j = (i + 1) To 3000
toCompare = Cells(j, 1)
If (StrComp(Value, toCompare, vbTextCompare) = 0) Then
Rows(j).EntireRow.Delete
End If
Next j
Next i
End Sub

If you are running xl07/10 then you can do this with a single line with Remove Duplicates. If you are running 03 then a solution with AutoFilter will be most efficient (I can provide this if you are on the older version)
Remove Duplicates
Manually
Select column A
Data .... Remove Duplicates
Expand selection
Select only column A to find duplicates on
Code
ActiveSheet.Range("$A$1:$A$3000").EntireRow.RemoveDuplicates Columns:=1, Header:=xlNo

To supplement #brettdj's answer, if you are running Excel 2003, you can do this using AdvancedFilter as follows:
Range("A1:A11").AdvancedFilter Action:=xlFilterInPlace, Unique:=True
Note: AdvancedFilter assumes that the first row of your range (row A in this example) contains column headers and will not include that row in the filtering.
To do this manually: Data > Filter > Advanced Filter... > Unique records only

Using Bretts technique is a good answer: but to answer your question about why does it take so long:
- Your macro is getting a value from over 4 million cells one by one. This is very slow.
- I don't see that your macro has switched off screenupdating and automatic calculation: every time a row is deleted the screen will refresh and Excel will recalculate. If you have not switched these off it is very slow.
This code should run a lot faster
Option Explicit
Sub Button1_DeleteRow()
Dim i As Long
Dim j As Long
Dim vArr As Variant
Dim iComp As Long
Dim Deletes(1 To 3000) As Boolean
Application.ScreenUpdating = False
iComp = Application.Calculation
Application.Calculation = xlCalculationManual
vArr = Range("a1:A3000")
For i = 1 To 3000
For j = (i + 1) To 3000
If (StrComp(vArr(i, 1), vArr(j, 1), vbTextCompare) = 0) Then
Deletes(j) = True
End If
Next j
Next i
For j = 3000 To 1 Step -1
If Deletes(j) Then Rows(j).EntireRow.Delete
Next j
Application.ScreenUpdating = True
Application.Calculation = iComp
End Sub

Sorting the data on column A would then make it trivial to identify and remove the duplicates in a single pass
In response to the comment below, I'll explain why sorting is a useful technique.
By sorting column A into order, duplicate removal simply becomes a matter of comparing adjacent entries in column A. You can then either delete the duplicate rows as you find them or flag them for later deletion.
The process should actually be a lot less tedious as you only have to sort the list (and sorting, being built-in, tends to be very fast) and then do one pass (instead of 4498500) through the list deleting/flagging as you go (obviously you need a subsequent clean-up pass if you go for flagging).
On the issue of changing the order of the list, start by adding an extra column (e.g. column D) and have D2 contain the value 2 (i.e. just the row number). A quick fill-down later and every row is numbered. After sorting and deleting/flagging, restoring the original order is just a matter of re-sorting on column D which could then be deleted.
I use this method when I have to perform some operation or other on the duplicates. In other words, column A has duplicate values but the values in columns B and C are meaningful (for example, I might want to sum these values from all of the entries relating to the specific value of column A). In many cases, however, it would be easier just to use SQL to achieve the same result

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Removing duplicates from large sheet - excel

Related

Excel VBA: How to swap two selected cell ranges (not only two values) within the same column?

Dynamic Summing Range

Link selective rows from one sheet to the next

Excel Lookup return multiple values horizontally while removing duplicates

Speeding up a search and delete macro

Categories

Resources