I have an Excel file and i want to compare the date in columns A and D and delete the gap between them.
For example based on this picture
enter image description here
Time in column A start at 14:56:23 and in D at 14:56:18. So i want to delete all the data in column D till 14:56:23 so that it will be the same in both A and D.
this problem will be repeated many times so i want to develop a macro to do it.
that is a small program to compare just first two cells in column A and D
Sub Edit_Date_time()
Dim r As Range
Dim l As Range
Set r = Range("A2")
Set l = Range("D2")
If r.Value <> l.Value Then
Range("D2:E2").Select
Selection.Delete Shift:=xlUp
End If
End Sub
the problem is that the cells contain date & time so i can not compare it as values.I have also to expand this code to cover the whole A2 & D2 column not only the first two cells.
Your question has morphed many times, but I am going with the question "how to eliminate rows without matching timestamps" ...
Comparing timestamps can be tricky, even if they are correctly formatted. You would expect 2/17/2016 14:56:29 to be equal to 2/17/2016 14:56:29, but there may be a difference in milliseconds that you cannot see in the string or in the general format. Therefore, you should use a tolerance when determining <, >, or =.
Remember, with timestamps 1.0 = 1 day. So 1/10 of a second is (1/24/60/60/10).
I assume you want to delete pressure rows when those timestamps are earlier than the corresponding temperature timestamps, AND you want to delete temperature rows when those timestamps are earlier than the corresponding pressure timestamps.
This means, worst case, a loop will need to go through the entire data set twice.
I tested code against this data ...
After processing, the yellow cells should align, the orange rows should be deleted. Here are the results I get ...
using this code ...
Sub ParseDateTime()
Dim TRange As Range, PRange As Range
Dim iLoop As Long, LoopEnd As Long
Dim theRow As Long, LastRow As Long
' set the range for the temperature data
LastRow = Sheets("Sheet1").Range("A" & Sheets("Sheet1").Rows.Count).End(xlUp).Row
Set TRange = Sheets("Sheet1").Range("A1:C" & LastRow)
LoopEnd = LastRow
' set the range for the pressure data
LastRow = Sheets("Sheet1").Range("D" & Sheets("Sheet1").Rows.Count).End(xlUp).Row
Set PRange = Sheets("Sheet1").Range("D1:E" & LastRow)
If LastRow > LoopEnd Then LoopEnd = LastRow
' loop through the range
theRow = 1
For iLoop = 2 To 2 * LoopEnd
theRow = theRow + 1
' stop searching when no more data
If TRange(theRow, 1) = "" And PRange(theRow, 1) = "" Then Exit For
' if out of temperature data, eliminate the rest of the pressure data
If TRange(theRow, 1) = "" Then
PRange.Rows(theRow).Delete Shift:=xlUp
theRow = theRow - 1
End If
' if out of pressure data, eliminate the rest of the temperature data
If PRange(theRow, 1) = "" Then
TRange.Rows(theRow).Delete Shift:=xlUp
theRow = theRow - 1
End If
' eliminate pressure rows where the time stamp is earlier than the temperature timestamp
If (TRange(theRow, 1).Value > PRange(theRow, 1).Value) And _
(Abs(TRange(theRow, 1).Value - PRange(theRow, 1).Value) >= (1# / 24# / 60# / 60# / 10#)) Then
PRange.Rows(theRow).Delete Shift:=xlUp
theRow = theRow - 1
Else
' eliminate temperature rows where the time stamp is earlier than the pressure timestamp
If (TRange(theRow, 1).Value < PRange(theRow, 1).Value) And _
(Abs(TRange(theRow, 1).Value - PRange(theRow, 1).Value) >= (1# / 24# / 60# / 60# / 10#)) Then
TRange.Rows(theRow).Delete Shift:=xlUp
theRow = theRow - 1
End If
End If
Next iLoop
End Sub
The redefinition(s) of this question makes it hard to deal with; particularly so as some (now removed) criteria would render some solutions to the current problem impractical.
I remember when your data came from multiple CSV files; some containing temperatures and some containing pressures. In fact there was so much data that it could conceivably 'spill' over to another worksheet. This fact alone renders individual worksheet value comparisons impractical. Even if it all fit on a single worksheet, comparing a million datetimes with a second set of a million datetimes and removing entries that do not fit both categories is going to be an arduous and time-consuming task.
Arduous and time-consuming tasks are best processed 'in-memory'. Repeatedly returning to the worksheet(s) to compare values is going to bog down processing and should be avoided unless absolutely necessary.
This seems like it should be an SQL question where two different sets of CSVs are loaded into two temporary but consolidated database tables and indexed on their respective datetimes. An INNER JOIN could then be performed to build a third table of matching records. Easy-peasy.
But this is an excel / vba question and should be answered in kind.
A VBA Scripting.Dictionary object is like an in-memory database table and comes with a unique primary 'index' called the key. It also has a single additional 'field' of the variant type which can receive any style of value or values that a variant can. Loading two dictionaries with the respective values (one for temperatures and another for the pressures) using the datetimes as the keys would seem to be the most efficient method of combining the two.
Sample data
I started with several CSVs similar to the following.
Temperaturen-25.csv SPS-25.csv
Three temperature CSVs and three pressure CSVs totalled ~300K records (~150K each) with periods of intentionally missing datetimes from each.
Module2 (Code)
Option Explicit
'public constant dictating the maximum number of entries per worksheet (never set higher than Rows.Count-3)
Public Const iMAXROWS As Long = 50000
Sub main()
Dim fp As String, fn As String, tmp As Variant
Dim dt As Variant, tdic As Object, pdic As Object
Dim tpwb As Workbook, a As Long, d As Long, w As Long
'apptggl btggl:=false 'uncomment this when you have finished debugging
'create 2 dictionary objects to receive ALL of the data
Set tdic = CreateObject("Scripting.Dictionary")
Set pdic = CreateObject("Scripting.Dictionary")
tdic.CompareMode = vbBinaryCompare
pdic.CompareMode = vbBinaryCompare
'load the dictionaries using the overwrite method
fp = Environ("TMP") & Chr(92) & "TempPress"
fn = Dir(fp & Chr(92) & "*.csv", vbNormal)
Do While CBool(Len(fn))
Select Case True
Case LCase(fn) Like "*temperaturen*"
'debug.Print "found " & fn
loadTPDictionary CStr(fp & Chr(92) & fn), tdic, 3
Case LCase(fn) Like "*sps*"
'debug.Print "found " & fn
loadTPDictionary CStr(fp & Chr(92) & fn), pdic, 2
Case Else
'do nothing; not temperature or pressure
End Select
'debug.Print tdic.Count & ":" & pdic.Count
fn = Dir
Loop
'debug.Print tdic.Count
'debug.Print pdic.Count
'At this point you have two dictionary object; one for temps and one for pressures
'They have a unique indexed key on their datetime values
'Time to merge the two
'First load all matching pressures into the temperatures
For Each dt In tdic
If pdic.Exists(dt) Then
tdic.Item(dt) = Array(tdic.Item(dt)(0), tdic.Item(dt)(1), tdic.Item(dt)(2), _
pdic.Item(dt)(1), pdic.Item(dt)(0))
End If
Next dt
'Second, get rid of temps that had no matching pressure
For Each dt In tdic
If UBound(tdic.Item(dt)) < 4 Then
tdic.Remove dt
End If
Next dt
'debug.Print tdic.Count
'debug.Print pdic.Count
'At this point the temp dictionary object contains a merged set of matching temps and pressures
'Time to put the values into one or more worksheets
'create a new target workbook and set up the first target worksheet
Set tpwb = Workbooks.Add
With tpwb
For w = 1 To Int(tdic.Count / iMAXROWS) + 1
a = 1: d = 1
'first load an array with the dictionary's values
ReDim tmp(1 To iMAXROWS, 1 To 5)
For Each dt In tdic
If d > (w * iMAXROWS) Then
Exit For
ElseIf d > ((w - 1) * iMAXROWS) Then
tmp(a, 1) = tdic.Item(dt)(0)
tmp(a, 2) = tdic.Item(dt)(1)
tmp(a, 3) = tdic.Item(dt)(2)
tmp(a, 4) = tdic.Item(dt)(3)
tmp(a, 5) = tdic.Item(dt)(4)
a = a + 1
End If
d = d + 1
Next dt
On Error GoTo bm_Need_Worksheet
With .Worksheets(w + 1) '<~~ ignore the original blank worksheet from the new workbook
'dump the values back into the worksheet
.Cells(2, 1).Resize(UBound(tmp, 1), UBound(tmp, 2)) = tmp
'format the datetimes
.Range("A2:A" & UBound(tmp, 1) + 1 & ",E2:E" & UBound(tmp, 1) + 1).NumberFormat = _
"[Color10]mm/dd/yyyy hh:mm:ss"
.Columns("A:E").AutoFit
End With
'clear the variant array
Erase tmp
Next w
'get rid of the original unprepped worksheet created with the new workbook
.Worksheets(1).Delete
'save as a binary workbook due to size considerations
.SaveAs Filename:=fp & Chr(92) & Format(Date, "\T\P\_yyyymmdd\_") & CLng(Timer), _
FileFormat:=xlExcel12, AddToMru:=True
'close savechanges:=false 'uncomment this after debugging
End With
'we got safely here; skip over worksheet creation
GoTo bm_Safe_Exit
bm_Need_Worksheet:
On Error GoTo 0
With tpwb.Worksheets.Add(After:=Sheets(Sheets.Count))
On Error GoTo bm_Need_Worksheet
.Range("A1:E1") = Array("Date and Time", "Temperature 1", "Temperature 2", _
"Pressure", "Date and Time (p)")
.Name = "Temperaturen & Pressure " & w
With .Parent.Windows(1)
.SplitColumn = 0: .SplitRow = 1
.FreezePanes = True
.Zoom = 75
End With
End With
Resume
bm_Safe_Exit:
'discard the dictionary objects
tdic.RemoveAll: Set tdic = Nothing
pdic.RemoveAll: Set pdic = Nothing
'restore the application environment
appTGGL
End Sub
Sub loadTPDictionary(fpn As String, ByRef dict As Object, flds As Long)
Dim f As Long, v As Long, vVALs As Variant, wb As Workbook
Workbooks.OpenText Filename:=fpn, StartRow:=1, DataType:=xlDelimited, _
ConsecutiveDelimiter:=False, _
Comma:=True, Tab:=False, Semicolon:=False, Space:=False, Other:=False, _
FieldInfo:=IIf(flds = 3, Array(Array(1, 3), Array(2, 1), Array(3, 1)), _
Array(Array(1, 3), Array(2, 1)))
With ActiveWorkbook
With Worksheets(1)
'Debug.Print .Cells(1, 1).Value
vVALs = .Range(.Cells(2, 1), .Cells(Rows.Count, flds).End(xlUp)).Value2
End With
.Close SaveChanges:=False
End With
If flds = 3 Then
For v = LBound(vVALs, 1) To UBound(vVALs, 1)
'fastest load method but overwrites duplicate datetime values with the last temp1, temp2
dict.Item(vVALs(v, 1)) = Array(vVALs(v, 1), vVALs(v, 2), vVALs(v, 3))
Next v
Else
For v = LBound(vVALs, 1) To UBound(vVALs, 1)
'fastest load method but overwrites duplicate datetime values with the last pressure
dict.Item(vVALs(v, 1)) = Array(vVALs(v, 1), vVALs(v, 2))
Next v
End If
Erase vVALs
End Sub
Public Sub appTGGL(Optional bTGGL As Boolean = True)
With Application
.ScreenUpdating = bTGGL
.EnableEvents = bTGGL
.DisplayAlerts = bTGGL
.AutoRecover.Enabled = bTGGL 'no interruptions with an auto-save
.Calculation = IIf(bTGGL, xlCalculationAutomatic, xlCalculationManual)
.CutCopyMode = False
.StatusBar = vbNullString
End With
Debug.Print Timer
End Sub
Refer to the in-code comments to follow the program flow. I seriously advise you to start with a smaller sample data set and work through the code using F8 and ctrl+F8. Set some watches on the vars. I've left many Debug.Print statements in that can be uncommented and their information observed through the VBE's Immediate window.
btw, my default workbooks are created with a single worksheet, not three like the default. You might want to adjust the code to remove everything but a single blank Worksheet Object immediately after the creation of a new target Workbook Object if you still open a new workbook with three blank worksheets. New worksheets are created to receive the data and are formatted on creation appropriately.
Results
While the results were produced quickly enough, I thought ~150K records (~135K after processing) were sufficient for testing. These results were split into multiple worksheets because of the iMAXROWS constant I set at 50K per worksheet.
TP_20160501_65489.xlsb
150K+150K processed into ~140K consolidated records in ~29 seconds.
You might also wish to seriously consider moving to a database solution.
See Highlight Duplicates and Filter by color alternative for pointers on dealing with large worksheets.
Related
I've got a workbook containing a Summary sheet and 200 numbered sheets that the user fills in one after the other.
The following macro checks about 125 cell values on every numbered sheet, and fills in the Summary, one line per numbered sheet.
If a numbered sheet hasnt been used yet, the macro fills in every column from column D to column DV with the minus sign "-" and goes on to check every numbered sheet one after the other till there's no more to check.
Is there a way to set it so that if an arbitrary number (let's say 10 lines) of the newly generated lines contain only the minus sign "-" from D to DV (Iw,4 to Iw, 126), then the macro would reach its end as it means all the remaining numbered sheets aren't used yet?
Sub SummaryMacro()
Dim Sh As Worksheet
Range("B2:L1000").ClearContents
Iw = 2 ' Index Write
For Each Sh In ActiveWorkbook.Sheets
If Sh.Name = "Summary" Then GoTo EndConsolidation
Cells(Iw, 1).Select
With Selection
.Hyperlinks.Add Anchor:=Selection, Address:="", SubAddress:="'" & Sh.Name & "'" & "!" & "A1", TextToDisplay:="Go to"
End With
Cells(Iw, 2) = Sh.Name
If Sh.Range("D8") = "" Then
Cells(Iw, 3) = "-"
Else
Cells(Iw, 3) = Sh.Range("D8")
End If
'Here the rest of the process (Iw, 4 till Iw, 125)
'The process also includes a few variations:
'Something like 20 of those with various text
If Sh.CheckBoxes("Check Box 1").Value = 1 Then Cells(Iw, 40) = "Declared" Else Cells(Iw, 40) = "-"
'Something like 30 of those with various text
If Sh.Range("H33") = "Issued" Then
Cells(Iw, 42) = "-"
Else
Cells(Iw, 42) = Sh.Range("H33")
End If
'But all in all they are mostly like that
If Sh.Range("C134") = "" Then
Cells(Iw, 126) = "-"
Else
Cells(Iw, 126) = Sh.Range("C134")
End If
Iw = Iw + 1
EndConsolidation:
Next Sh
End Sub
Try adding this code to your For loop at the end:
If (WorksheetFunction.CountIf(Range("D" & Iw & ":DV" & Iw), "-") = 123) Then
Cntr = Cntr + 1 'Blank sheet found
Else
Cntr = 0 'Not blank - Restart counter
End If
If (Cntr = 10) Then Exit For
This counts the number of - in your row and if it equals 123 (D-DV) then it increments the counter otherwise it clears the counter. When Cntr reaches 10 it exits the loop.
HTH
Add this code before your For loop ends
Dim counter As Integer
Dim previousRowBlank As Boolean
counter = 0
previousRowBlank = True
'count if all the 123 cells contain - string
If (WorksheetFunction.CountIf(Sheets("Summary").Range("D" & Iw & ":DV" & Iw), "-") = 123) Then
If (counter = 0) Then
counter = counter + 1
previousRowBlank = True
Else
If (previousRowBlank = True) Then
counter = counter + 1
End If
End If
Else
previousRowBlank = False
counter = 0
End If
'assuming you want to exit when 10 consecutive rows are blank
If (counter = 10) Then
Exit Sub
End If
When I have something like this I Dim a Boolean variable (perhaps call it isPopulated) which only gets switched to true when one of the cells has a value to act on. Then for your case after 10 (or however many you choose) lines, insert an If isPopulated = False Then Exit For to skip the remaining sheets.
EDIT; another idea I just had for you - if all the cells you're checking are supposed to have numeric values then you could use the below;
If Not WorksheetFunction.Concat(Range("D8"), Range("C134"), etc) Like "*#*" Then
'Code here to skip this and remaining sheets.
Obviously you'd need to add the relevant ranges inside the concat() brackets. What that will do is join the contents of those cells together, then check the result for any numbers "*#*" (you could also check for any letters using "*?*"). That gives you a one-code-line answer to the basic question 'is this sheet populated or not'.
I'm sure it's a bad idea to terminate the macro prematurely, based on such an imprecise criterion as the number of "empty" sheets in series. If data starts again on the 11th, 15th or 30th sheet, then you will not process it, you will lose it.
Your macro is not very complex, it shouldn't take longer than a few seconds. For modern Excel, 25K cells are very few
Your code can be shortened a little, simplified. After all, you know all the addresses of the cells that you need to check on each sheet, you enter them in the macro code sequentially, right? Write them on one line separated by commas and put them in a constant.
After that, the whole code will become much shorter:
Sub SummaryMacro()
Const REQUIRED_CELLS_ADDRESS As String = "D8,...<all other source cells>...,B6"
Const SUM_SHEETNAME As String = "Summary"
Dim ws As Worksheet
Dim wsSum As Worksheet
Dim rCell As Range
Dim oTargetCell As Range
Dim oSumCell As Range
Dim aAddress As Variant
Dim i As Integer
aAddress = Split(REQUIRED_CELLS_ADDRESS, ",")
Set wsSum = ActiveWorkbook.Worksheets(SUM_SHEETNAME)
wsSum.UsedRange.Offset(1, 0).ClearContents
Set oTargetCell = wsSum.Range("A1")
For Each ws In ActiveWorkbook.Worksheets
If ws.Name <> SUM_SHEETNAME Then
Set oTargetCell = oTargetCell.Offset(1, 0)
wsSum.Hyperlinks.Add Anchor:=oTargetCell, Address:="", SubAddress:="'" & ws.Name & "'" & "!" & "A1", TextToDisplay:="Go to"
oTargetCell.Resize(1, 123).Value = "-"
Set oSumCell = oTargetCell.Offset(0, 1)
oSumCell.Value = ws.Name
For i = LBound(aAddress) To UBound(aAddress)
Set rCell = ws.Range(aAddress(i))
Set oSumCell = oSumCell.Offset(0, 1)
If Not IsEmpty(rCell) Then oSumCell.Value2 = rCell.Value2
Next i
End If
Next ws
End Sub
Update Everyone knows that working with an array in RAM is much faster than working with sheet cells. Therefore, the outer loop - iterating over the sheets of the book - remains the same, but we change the code inside the loop in this way:
Sub SummaryMacro()
Const SUM_SHEETNAME As String = "Summary"
Dim ws As Worksheet
Dim wsSum As Worksheet
Dim oTargetCell As Range
Dim aResData As Variant
aAddress = Split(REQUIRED_CELLS_ADDRESS, ",")
Set wsSum = ActiveWorkbook.Worksheets(SUM_SHEETNAME)
wsSum.UsedRange.Offset(1, 0).ClearContents
Set oTargetCell = wsSum.Range("A1")
For Each ws In ActiveWorkbook.Worksheets
If ws.Name <> SUM_SHEETNAME Then
Set oTargetCell = oTargetCell.Offset(1, 0)
wsSum.Hyperlinks.Add Anchor:=oTargetCell, Address:="", SubAddress:="'" & ws.Name & "'" & "!" & "A1", TextToDisplay:="Go to " & ws.Name
aResData = validateData(ws.Range("A1:L140").Value2) ' Or "D8:C134" or any other
oTargetCell.Offset(0, 1).Resize(1, UBound(aResData) + 1).Value = aResData
End If
Next ws
End Sub
The main trick is hidden in this line aResData = validateData(ws.Range("A1:L140").Value2)
We call our function and pass it as a parameter an array of cell values from the entire next sheet. Further work on analysis and processing will be carried out with the elements of this array. However, this is not the whole trick.
The validateData() function is very simple and looks like this:
Function validateData(aD As Variant) As Variant
validateData = validateValues(aD(1, 5), aD(2, 8), aD(3, 1), aD(2, 11), _
........ , _
aD(111, 3), aD(112, 8), aD(123, 9), aD(126, 10))
End Function
In other words, we select from the entire large array of aD (the name is deliberately made short, because in this function it will have to be repeated 123 times) only those values that need to be analyzed and we pass on to the next function. Despite the seeming simplicity, this is the most time-consuming part - you need to select from the sheet all cells "D8", "C134", etc. and write down their coordinates (row, column) as numbers aD(4,8), aD(3,134), etc.
Perhaps can help in this the mode R1C1 of displaying the sheet. Or some kind of helper function that will be used when preparing the code (not when executing - we agreed that we will no longer access cells for get values or for .Row and .Column properties!)
What will the validateData() function get? A long one-dimensional array aData(0 To 122) of cell values in the listed order. That is, as many values as there are cells to be filled in the Summary row for this sheet.
The last trick is the process of processing values. It would seem that we have gained nothing from all these transformations. But you claim that there are three groups of checks - for an empty value, for a boolean value (checkbox) and for text lines. This is how it is handled:
Function validateValues(ParamArray aData() As Variant) As Variant
Dim i As Variant
Dim aResult As Variant
ReDim aResult(LBound(aData) To UBound(aData))
For i = LBound(aData) To UBound(aData)
Select Case i
Case 1, 5, 7, 9 ' Checking cells empty / value
aResult(i) = IIf(aData(i) = "", "-", aData(i))
Case 4, 6, 10 ' Checking cells boolean True / "not True" (False or blank)
aResult(i) = IIf(aData(i), "Declared", "-")
Case 0, 3, 8 ' Checking cells string "Issued" / other
aResult(i) = IIf(aData(i) = "Issued", "-", aData(i))
Case 2, 91, 118 ' Checking cells string "Pending" / other
aResult(i) = IIf(aData(i) = "Issued", "-", aData(i))
Case Else ' In a real macro, this line is not needed, it will never be executed because all the cells of the array are already listed above, this is useful only for debugging while all conditions will be written
aResult(i) = "-"
Debug.Print "Cell #" & i & " not processed yet"
End Select
Next i
validateValues = aResult
End Function
And now - again, in just one call! - we write a whole row of results:
oTargetCell.Offset(0, 1).Resize(1, UBound(aResData) + 1).Value = aResData
I am sure that these tricks will reduce the time it takes to form the summary sheet many times over. Please try this and let me know if it gets better?
Some background: Each month I build a pivot table that has approx 30 or so business units (along the y axis) - lets call them groups. Each group has a number of GL accounts that change month to month. For example, Group 14 might have 10 GL accounts one month than the next have only 3. For each group, we need the summation of the totals for the GL accounts (that start with PL203000 & PL211010) for each group. Before we had to total these GL accounts for each group by hand. This has been solved with the code I have displayed below.
The code works perfectly when each group has more than one GL account (See pic 1)
The problem I am facing is when there is only one GL account, the code doesn't sum the correct amounts (see 2nd pic).
When digging into my code, you can see that it is summing the incorrect sections since i have a Rows.Count.End(xlUp) establishing the range. If there is only one GL account, it skips to the next section thereby establishing an incorrect formula
Perhaps my code needs to be completely revamped in order to account for groups where there is only one GL account to sum? If so, what sort of if statement can i code where it ignores groups that have only one GL account?
If not, than is the solution to have VBA count the range and if it is less than 3, ignore group and move on to the next?
'this section spits out the values needed to sum
For i = nRowMax To 4 Step -1
If Left(Cells(i, 1), 8) = "PL211010" Or Left(Cells(i, 1), 8) = "PL203000"
Then
Cells(i, 4).Copy
Cells(i, 5).PasteSpecial xlPasteValues
Range(Cells(i, 1), Cells(i, 4)).Select
With Selection.Interior
.Pattern = xlSolid
.PatternColorIndex = xlAutomatic
.Color = 65535
.TintAndShade = 0
.PatternTintAndShade = 0
End With
End If
Next i
Application.CutCopyMode = False
'this section uses the values the first section specified to write the sum formula
'i believe the macro uses this section of code to write the first formula and the next section of code writes the formulas for the rest of the groups
Dim firstRow As Variant
Dim finalRow As Variant
finalRow = Range("E" & Rows.Count).End(xlUp).Row
firstRow = Cells(finalRow, 5).End(xlUp).Row
If IsNumeric(Cells(finalRow + 1, 5)) Then
Cells(firstRow, 6).Formula = "=SUM(D" & firstRow & ":D" & finalRow & ")"
End If
'this section goes through the whole sheet to sum each group
For y = firstRow To 5 Step -1
finalRow = Cells(y, 5).End(xlUp).Row
firstRow = Cells(finalRow, 5).End(xlUp).Row
If firstRow < 5 Then firstRow = 5
If IsNumeric(Cells(finalRow + 1, 5)) Then
Cells(firstRow, 6).Formula = "=SUM(D" & firstRow & ":D" & finalRow &")"
End If
y = firstRow
'If firstRow = 5 Then Exit Sub
Next y
If your dataset is an accurate enough example, you can scan through your business units and pick out only what you need. I have some example code here that will build up your sum range by using the Union function and applying that to the SUM formula when the entire business unit has been scanned. Of course, this is only an example that fits the data shown. You'll have to expand it to fit situations that are not visible to me.
To simplify the logic, I've separated the code into a function that will start scanning rows for a business unit and will stop when it reaches the end of the business unit -- the test I'm using for detecting the start of the next BU is a line that does not start with "PL". This may or may not be correct for all your data.
Because this code is checking each line and accumulating the sum range using the Union, if you only have one cell, you'll still get a formula that says =SUM($D$30) but it works.
Option Explicit
Sub test()
Dim dataArea As Range
Set dataArea = ActiveSheet.Range("A1")
Do While Not IsEmpty(dataArea.Cells(1, 1))
Set dataArea = AddSums(dataArea)
Loop
End Sub
Private Function AddSums(ByRef businessUnitStart As Range) As Range
'--- loops through cells following the 'Start' range given,
' and accumulates the range of accounts to summarize
' RETURNS the start of the next business unit range
Dim accountRow As Range
Dim account As String
Set accountRow = businessUnitStart.Offset(1, 0)
Dim sumArea As Range
Do While Left$(accountRow.Cells(1, 1).Value2, 2) = "PL"
account = accountRow.Cells(1, 1).Value2
If (Left$(account, 8) = "PL211010") Or (Left$(account, 8) = "PL203000") Then
'--- add this account to the sum formula
If sumArea Is Nothing Then
Set sumArea = accountRow.Cells(1, 4)
Else
Set sumArea = Union(sumArea, accountRow.Cells(1, 4))
End If
End If
Set accountRow = accountRow.Offset(1, 0)
Loop
If Not sumArea Is Nothing Then
Dim accountSum As Range
Set accountSum = businessUnitStart.Offset(1, 6)
accountSum.Formula = "=SUM(" & sumArea.Address & ")"
End If
Set AddSums = accountRow
End Function
My spreadsheet has about 800,000 rows with 30 columns. Customers are interested in duplicate values only in one column. They need the entire row back. For e.g.
MemberId|Name|Address|CircleScore
H111|John Doe|123 W Main|2.4
H222|Jane Doe|124 W Main|3.2
H333|Bob Doe|125 W Main|2.5
H444|Jake Doe|126 W Main|2.1
H555|Mike Doe|127 W Main|2.4
They want the entire rows where there are duplicates in CircleScore. So my filtered excel should only contain:
MemberId|Name|Address|CircleScore
H111|John Doe|123 W Main|2.4
H555|Mike Doe|127 W Main|2.4
I tried highlighting duplicate CircleScore and filtering, but the filtering part takes for ever. I have waited for 15 minutes but still no luck. The duplicates could be around 150k.
Is there an alternative?
I would create an Is_Duplicated indicator column and use that to filter the duplicated CircleScores:
UPDATE (per comments):
Alternatively, you can sort the CircleScore column and make the formula a bit less taxing on your system (NOTE CircleScore must be sorted beforehand):
Please disregard this submission if you are a) getting paid by the hour and feel underpaid, b) planning on a nap while the routine processes, or c) both a) and b).
With any data set approaching 800K rows (with 30 columns) you are going to want to step into the variant array arena. With processing typically 5-7% of the time it takes to work with the worksheet values, it is very appropriate for large data blocks.
Anytime that the word 'duplicates' comes into play, I immediately start thinking about how a Scripting.Dictionary object's unique index on its Keys can benefit. In this solution I used a pair of dictionaries to identify the rows of data with a repeated Circle Score value.
Twenty-four million cells of data is a lot to store and transfer. Bulk methods beat individual methods every time and the bulkiest method of peeling off the data would be to stuff all 800K rows × 30 columns into a variant array. All processing becomes in-memory and the results are returned to a report worksheet en masse.
isolateDuplicateCircleScores code
Sub isolateDuplicateCircleScores()
Dim d As Long, v As Long, csc As Long, stmp As String
Dim ky As Variant, itm As Variant, vVALs As Variant, dCSs As Object, dDUPs As Object
Dim w As Long, vWSs As Variant
'early binding
'dim dCSs As new scripting.dictionary, dDUPs As new scripting.dictionary
appTGGL bTGGL:=False
'late binding - not necessary with Early Binding (see footnote ¹)
Set dCSs = CreateObject("Scripting.Dictionary")
Set dDUPs = CreateObject("Scripting.Dictionary")
'set to the defaults (not necessary)
dCSs.comparemode = vbBinaryCompare
dDUPs.comparemode = vbBinaryCompare
'for testing on multiple row number scenarios
'vWSs = Array("CircleScores_8K", "CircleScores_80K", "CircleScores_800K")
'for runtime
vWSs = Array("CircleScores") '<~~ your source worksheet here
For w = LBound(vWSs) To UBound(vWSs)
'ThisWorkbook.Save
Debug.Print vWSs(w)
Debug.Print Timer
With Worksheets(vWSs(w))
On Error Resume Next
Worksheets(vWSs(w) & "_dupes").Delete
On Error GoTo 0
ReDim vVALs(0)
dCSs.RemoveAll
dDUPs.RemoveAll
'prep a new worksheet to receive the duplicates
.Cells(1, 1).CurrentRegion.Resize(2).Copy
With Worksheets.Add(after:=Worksheets(.Index))
.Name = vWSs(w) & "_dupes"
With .Cells(1, 1)
.PasteSpecial Paste:=xlPasteAllUsingSourceTheme, Operation:=xlNone
.PasteSpecial Paste:=xlPasteColumnWidths, Operation:=xlNone
.Value = .Value2
.Offset(1, 0).EntireRow.ClearContents
End With
End With
'finish prep with freeze row 1 and zoom to 80%
With Application.Windows(1)
.SplitColumn = 0
.SplitRow = 1
.FreezePanes = True
.Zoom = 80
End With
'grab all of the data into a variant array
ReDim vVALs(0)
csc = Application.Match("CircleScore", .Rows(1), 0) 'CircleScore column number needed later
vVALs = .Range(.Cells(2, 1), _
.Cells(.Cells(Rows.Count, csc).End(xlUp).Row, _
.Cells(1, Columns.Count).End(xlToLeft).Column)).Value2
'Debug.Print LBound(vVALs, 1) & ":" & UBound(vVALs, 1) '1:~800K
'Debug.Print LBound(vVALs, 2) & ":" & UBound(vVALs, 2) '1:~30
End With 'done with the original worksheet
'populate the dDUPs dictionary using the key index in dCSs
For v = LBound(vVALs, 1) To UBound(vVALs, 1)
If dCSs.exists(vVALs(v, csc)) Then
stmp = vVALs(v, 1)
For d = LBound(vVALs, 2) + 1 To UBound(vVALs, 2)
stmp = Join(Array(stmp, vVALs(v, d)), ChrW(8203))
Next d
dDUPs.Add Key:=v, Item:=stmp
If Not dDUPs.exists(dCSs.Item(vVALs(v, csc))) Then
stmp = vVALs(dCSs.Item(vVALs(v, csc)), 1)
For d = LBound(vVALs, 2) + 1 To UBound(vVALs, 2)
stmp = Join(Array(stmp, vVALs(dCSs.Item(vVALs(v, csc)), d)), ChrW(8203))
Next d
dDUPs.Add Key:=dCSs.Item(vVALs(v, csc)), Item:=stmp
End If
Else
dCSs.Item(vVALs(v, csc)) = v
End If
Next v
'split the dDUPs dictionary items back into a variant array
d = 1
ReDim vVALs(1 To dDUPs.Count, 1 To UBound(vVALs, 2))
For Each ky In dDUPs.keys
itm = Split(dDUPs.Item(ky), ChrW(8203))
For v = LBound(itm) To UBound(itm)
vVALs(d, v + 1) = itm(v)
Next v
d = d + 1
Next ky
'put the values into the duplicates worksheet
With Worksheets(vWSs(w) & "_dupes")
.Cells(2, 1).Resize(UBound(vVALs, 1), UBound(vVALs, 2)) = vVALs
With .Cells(1, 1).CurrentRegion
With .Resize(.Rows.Count - 1, .Columns.Count).Offset(1, 0)
.Rows(1).Copy
.PasteSpecial Paste:=xlPasteFormats, Operation:=xlNone
End With
.Cells.Sort Key1:=.Columns(csc), Order1:=xlAscending, _
Key2:=.Columns(1), Order2:=xlAscending, _
Orientation:=xlTopToBottom, Header:=xlYes
End With
End With
Debug.Print Timer
Next w
dCSs.RemoveAll: Set dCSs = Nothing
dDUPs.RemoveAll: Set dDUPs = Nothing
appTGGL
End Sub
Public Sub appTGGL(Optional bTGGL As Boolean = True)
With Application
.ScreenUpdating = bTGGL
.EnableEvents = bTGGL
.DisplayAlerts = bTGGL
.AutoRecover.Enabled = bTGGL 'no interruptions with an auto-save
.Calculation = IIf(bTGGL, xlCalculationAutomatic, xlCalculationManual)
.CutCopyMode = False
.StatusBar = vbNullString
End With
Debug.Print Timer
End Sub
Sample Data and Results
800K rows × 30 columns of random sample data
~123K rows × 30 columns of duplicate rows (sorted and formatted in about a minute-and-a-half)
Timed Results
tbh, I never got the 32-bit version of Excel on the older laptop to run the 800K pass run more than once without restarting Excel. Once restarted the results were consistent with what is shown. The 64-bit Excel ran repeatedly without a hiccup.
Large Worksheet Addendum
When dealing with worksheets containing large data blocks there are a few general improvements that can limit your wait times. You're using Excel as a medium sized database tool so treat the data worksheet as the raw data that it should be.
If you are not working with a 64-bit version of Excel then you are wasting time with everything you do. See What version of Office am I using? and Choose the 32-bit or 64-bit version of Office.
Save as an Excel Binary Workbook (e.g. .XLSB). The file size is typically 25-35% of the original. Load times are improved and some calculation is quicker (sorry, do not have empirical timed data on the latter). Some operations that crash an .XLSX or .XLSM work fine with an .XLSB.
Disable Auto-Save/Auto-Recover in the options for the workbook. ([alt]+F, T, S, [alt]+D, [OK]). There are few things more irritating than waiting for an auto-save to complete when you are trying to do something. Get used to Ctrl+S when YOU want to save.
Avoid volatile functions¹ at all costs; particularly in formulas that are used in the full scope of the data. A single TODAY() in a COUNTIF formula filled down for the extent of the rows will have you sitting on your thumb more often than not.
Speaking of formulas, revert all formulas to their result values whenever possible.
Merged cells, conditional formatting, data validation and making cells look pretty with formatting and styles slows you down. Minimize the use of anything that takes away from raw data. It isn't like anyone is actually going to look through 800K rows of data.
After removing data use Home ► Editing ► Clear ► Clear All on the vacant cells. Tapping the Del only clears the contents and may not reset the
Worksheet.UsedRange property; Clear All will facilitate resetting the .Used Range on the next save.
If you have hooped your computer with one or more Excel [Not Responding] scenarios, reboot your machine. Excel never fully recovers from these and simply restarting Excel to start over is slower and more likely to enter the same Not Responding condition later.
¹ If you can convert the late binding of the Scripting.Dictionary to early binding, you must add Microsoft Scripting Runtime to the VBE's Tools ► References.
² Volatile functions recalculate whenever anything in the entire workbook changes, not just when something that affects their outcome changes. Examples of volatile functions are INDIRECT, OFFSET, TODAY, NOW, RAND and RANDBETWEEN. Some sub-functions of the CELL and INFO worksheet functions will make them volatile as well.
Try this Vba-code (and learn a little bit Dutch)
Sub DuplicatesInColumn()
'maakt een lijst met de aangetroffen dubbelingen
Dim LaatsteRij As Long
Dim MatchNr As Long
Dim iRij, iKolom, iTeller, Teller As Long, ControlKolom As Long
iRij = 1
iKolom = 5 'number of columns in the sheet, Chance if not correct
ControlKolom = 4 'column number where to find the doubles, Chance if not correct
LaatsteRij = Cells(65000, iKolom).End(xlUp).Row: iTeller = iKolom
Sheet1.Activate
For iRij = 1 To LaatsteRij
If Cells(iRij, ControlKolom) <> "" Then
MatchNr = WorksheetFunction.Match(Cells(iRij, ControlKolom), Range(Cells(1, ControlKolom), Cells(LaatsteRij, ControlKolom)), 0)
If iRij <> MatchNr Then
iTeller = iKolom
For Teller = 1 To iTeller
Cells(iRij, iKolom + Teller).Offset(0, 2).Value = Range(Cells(iRij, Teller), Cells(iRij, Teller)).Value
Next Teller
End If: End If
Next
End Sub
I have a data set with Names and Addresses in an Excel file in following format.
Name1
134/47/1,
adrs1, adr2, country
Name2
adrs1, adrs2, country
Name3
107/c,
adrs3, adrs3, country
etc…
I want to split these data into multiple rows in following format
Name1
134/47/1,
adrs1,
adrs2,
country
Name2
No 134/63,
adrs1,
adrs2,
country
etc…
I tried following but it worked for one row cell only.
Sub tst()
Dim X As Variant
X = Split(Range("A1").Value, ",")
Range("A1").Resize(UBound(X) - LBound(X) + 1).Value = Application.Transpose(X)
End Sub
The following macro might help you. You would have to select the very last cell in your table containing a multipart address. When you start the macro it will then work its way up to the top and insert address lines where needed (only in the current column) and then exit.
Option Base 1
Sub trnsfrm()
Dim i%, n%, ret(3, 1)
Set r = Selection
Do
a = Split(r, ",")
ret(1, 1) = Trim(a(0))
ret(2, 1) = Trim(a(1))
ret(3, 1) = Trim(a(2))
r.Range([a2], [a3]).Insert Shift:=xlDown
r.Range([a1], [a3]) = ret
If r.Row <= 4 Then Exit Do
Set r = r.Offset(-4)
Loop
End Sub
If you want to insert lines across the whole table you should replace the line (10)
r.Range([a2], [a3]).Insert Shift:=xlDown
by
r.Range([a2], [a3]).EntireRow.Insert Shift:=xlDown
Assumptions / Warning
Since the macro will actually change your current table and 'undo' does not work with macros you should definitely save everything before you try it.
The macro assumes that each address block consists of exactly 4 lines. If there are fewer or more lines to an address the maro will get out of sync and will very likely output garbage or halt.
I'm not sure whether your sample data had trailing commas on single values as a typo or if that is what accurately represents your data but that should be accounted for. A rogue comma as a suffix will create an extra element to the variant array thereby throwing off dimensions created by referencing the UBound function.
Sub split_from_below_space()
Dim rw As Long, v As Long, vVALs As Variant
With Worksheets("Sheet1") 'set this worksheet reference properly!
For rw = .Cells(Rows.Count, 1).End(xlUp).Row To 1 Step -1
.Cells(rw, 1) = Trim(.Cells(rw, 1).Value2)
If CBool(InStr(1, .Cells(rw, 1).Value2, Chr(44) & Chr(32))) Then
vVALs = Split(.Cells(rw, 1).Value2, Chr(44) & Chr(32))
.Cells(rw + 1, 1).Resize(UBound(vVALs), 1).EntireRow.Insert
.Cells(rw, 1).Resize(UBound(vVALs) + 1, 1) = _
Application.Transpose(vVALs)
For v = UBound(vVALs) - 1 To LBound(vVALs) Step -1
.Cells(rw, 1).Offset(v, 0) = _
Trim(.Cells(rw, 1).Offset(v, 0).Value2) & Chr(44)
Next v
End If
Next rw
End With
End Sub
You will need to insert rows to accommodate the data and that method is almost always (as in this case) better performed by working from the bottom to the top.
So this problem has been presented to me and it's becoming a huge roadblock in the production of my website. I'm not new to excel when it comes to the interface but writing functions is something that I have never had to deal with. I have a table with values labeled by reference number that basically relay a form that was filled out by a certain provider. The column with all the different answers to the fields (Yes, it has different field answers in one column, sigh) needs to be broken up so I can label them with column headers in order to eventually import them into an SQL database. The source is current delivered in this format:
What I need to do is be able to fill out a column-based version of these values that looks like:
The criteria for creating columns is based on the values in A (ref #) B C and D. I'm guessing I need to create some sort of conditional statement that checks if C and D are equal to a certain value (C and D designate the type of information that is in E so they are pretty much my key element/conditionals) and then places the information in cell E underneath the correct column header. I have been researching functions such as VLookup/Match/Index and I can't make much sense of how to apply them or if there is possibly a better function I can use to accomplish my task. Even a reference to a relevant SO thread would be great at this point. I basically just need some guidance as to what it would take to make this work. On top of that, the reference numbers ascend but are not in any particular order therefore I am wondering if it possible to feed a function a list of reference numbers to increment to once all the conditionals have all been run through for a particular reference number.
EDIT: Ok so here is my new issue -->
The images as you requested
Original Data: http://imgur.com/htvzqNU
After VBA Script: http://imgur.com/cDQQxE6
This is the only code we edited:
vHDRs = Array(Array("Reference #", -1, -2), _
Array("Provider Name", 300, 100), _
Array("Provider Number", 300, 300), _
Array("County", 200, 400), _
Array("Address", 100, 100), _
Array("Zip", 200, 300))
As you can see, the column for addresses does not populate
Here is a fairly standard VBA sub with enough safeties that it shouldn't destroy anything of substance.
Sub My_Organize()
Dim rw As Long, v As Long, vHDRs As Variant
Dim i As Long, j As Long, iREFNO As Long, iREFROW As Long, iLR As Long
Dim ws As Worksheet, app As Application
Set app = Application
app.ScreenUpdating = False
app.EnableEvents = False
app.DisplayAlerts = False
app.Calculation = xlCalculationManual
On Error Resume Next
Worksheets("Organized").Delete
On Error GoTo Safe_Exit
Worksheets.Add(after:=Sheets(Sheets.Count)).Name = "Organized"
Set ws = Sheets(Sheets.Count)
vHDRs = Array(Array("Reference #", -1, -2), _
Array("Provider Name", 4200, 100), _
Array("Phone #", 4300, 100))
ws.Cells(1, 1).Resize(1, UBound(vHDRs) + 1) = app.Transpose(app.Index(vHDRs, , 1))
With Sheet1
iLR = .Cells(Rows.Count, 1).End(xlUp).Row
With .Cells(1, 1).CurrentRegion
.Cells.Sort Key1:=.Columns(1), Order1:=xlAscending, _
Key2:=.Columns(3), Order2:=xlAscending, _
Key3:=.Columns(4), Order3:=xlAscending, _
Orientation:=xlTopToBottom, Header:=xlYes
For rw = 2 To iLR
If iREFNO <> .Cells(rw, 1).Value2 Then
iREFNO = .Cells(rw, 1).Value2
iREFROW = ws.Cells(Rows.Count, 1).End(xlUp).Row + 1
ws.Cells(iREFROW, 1) = iREFNO
End If
For i = LBound(vHDRs, 1) To UBound(vHDRs, 1)
If .Cells(rw, 3).Value2 = vHDRs(i)(1) And _
.Cells(rw, 4).Value2 = vHDRs(i)(2) Then
ws.Cells(iREFROW, i + 1) = .Cells(rw, 5).Value2
Exit For
End If
Next i
Next rw
End With
End With
Safe_Exit:
Set ws = Nothing
app.Calculation = xlCalculationAutomatic
app.DisplayAlerts = True
app.EnableEvents = True
app.ScreenUpdating = True
Set app = Nothing
End Sub
Edit the nested array of vHDRs information to match what you want to collect and transpose from the source worksheet. Just add a new nested array into that and change the label and the numbers to match from columns C & D. They do not have to be in any special order in the outer array but each inner array should be label, column C, column D.
With your data pasted into a new workbook's Sheet1, run that routine against it. It will create a new worksheet at the end of the queue and transpose the data according to the parameters you set up in the array of column header labels, and two other numbers to match from columns C and D on the source worksheet (i.e. Sheet1).
If you run that repeatedly against 23M rows (in multiple worksheets) then the values could be bulk fed into an array so that all processing would be done in memory.