Avoid crashing VBA when running IE web scraping multiple times

Avoid crashing VBA when running IE web scraping multiple times - excel

My excel crashes when I scrape a website for information multiple times, and insert it into cell
I already included in my code set IE = Nothing and IE Quit, but it doesn't change the fact that the code returns an error after multiple iterations
My code consists of a loop-part and the actual scraping. Here is the loop:
Public Sub LooperForMMDescription()
Dim currentValue As String
Dim dataList As Variant
Dim i As Integer
Dim n As Integer
Dim FirstRow As Integer
Dim IE As Object
n = 1
Set dataList = Range("Table6")
FirstRow = Range("Table6").Row - 1
'On Error Resume Next
Set IE = Nothing
For i = 1 To UBound(dataList.Value)
If IsEmpty(dataList.Value) Then
Exit Sub
Else
currentValue = dataList(i, 1).Text
If Len(currentValue) = 0 Then
GoTo ByPass
End If
Call MM_description(currentValue, n, FirstRow, IE)
ByPass:
n = n + 1
End If
Next i
Sheets("Input").Range("F7").Select
End Sub
And this is the actual scraping:
Public Sub MM_description(currentValue As String, n As Integer, FirstRow As Integer, IE As Object)
Dim html As HTMLDocument
Dim codeLine As String
Dim startPos As Long
Dim endPost As Long
Set IE = Nothing
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = False
IE.Navigate2 (currentValue)
Do While IE.Busy
Application.Wait DateAdd("s", 1, Now)
Loop
mes = IE.document.body.innerHTML
startPos = InStr(mes, "Description") + 61
endPos = InStr(mes, "Address")
If startPos = 0 Then
Sheets("Input").Range("F" & FirstRow + n).Value = "Not Found"
Else
codeLine = Mid(mes, startPos, endPos - startPos - 229)
Sheets("Input").Range("F" & FirstRow + n).Value = codeLine
End If
IE.Quit
Set IE = Nothing
End Sub
The code runs fine for 80-90 iterations, but then it returns an error

So, this is more of a code review than an answer. The following are notes on your code and a suggested re-write.
Use Long not Integer as this reduces the risk of overflow which can happen with Integer datatype particularly when dealing with loops of rows (there are more rows than Integer can handle). Additionally, there is no performance benefit here from Integer v Long.
Camelcase local variables
firstRow
Improve readability by using worksheet variables
Dim ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Sheet1")
Use explicit sheet references not bug prone implicit Activesheet references. Using ws variable from above:
Range("Table6")
which has implicit Activesheet reference can have an explicit sheet reference
ws.Range("Table6")
dataList.value is a 2d array, as you are reading in a range from a worksheet:
For i = 1 To UBound(dataList.Value)
So, there should be a second dimension specified in your loop and it would be more efficient to read that 2d array into a variable, rather than incurring the repeated expensive i/o of going out to sheet for a value
I don't know what your table6 looks like but I suspect you are attempting to loop a specific column (likely the first)
You could then, instead, put the table into a variable and then read its first column values (excluding header) into a 1D array to loop. As you will later be writing out values to the sheet again, dimension an output array to the same dimensions as the array you are looping to store the results of your loop in
Dim arr(), table As ListObject, output()
Set table = ws.ListObjects("Table6")
arr = Application.Transpose(table.ListColumns(1).DataBodyRange.Value)
ReDim output(1 To UBound(arr))
This
If IsEmpty(dataList.Value) Then
Exit Sub
Else
is basically looking at whether the table databodyrange is empty. Assuming your are checking if there are any urls in column 1 of your table then this test is only needed
once before the loop and can be a one liner without the If Else End If
If IsEmpty(arr) Then Exit Sub
Consider renaming local variables to more useful/descriptive values: currentValue to currentUrl as this is more useful IMO.
This
If Len(currentValue) = 0 Then
GoTo ByPass
End If
is basically checking whether there is a value to pass as an url and using GoTo to handle not present. Avoid GoTo where possible as it makes code harder to read. It isn't needed here. You can use a quick vbNullString comparison, or even better Instr(url, "http") > 0 to validate the value you will be working with:
(I have switched from currentValue)
'initial code
If currentUrl <> vbNullString Then 'test
'call the sub and extract value
End If
n = n + 1 'increment....loop....rinse....repeat
Alterative validation:
If instr(currentUrl, "http") > 0 Then 'test
'call the sub and extract value
End If
n = n + 1 'increment....loop....rinse....repeat
As you already have a loop variable of i then n isn't really needed at all. Particularly in light of populating an output array at same indices.
ie is already nothing when you have Dim ie As..... You want to instantiate the object at the start
Set ie = CreateObject("InternetExplorer.Application")
Then work with that instance throughout your loop. You already include ie in your scraping sub signature, so it is expected that you will pass the same instance around:
Public Sub MM_description(currentValue As String, n As Long, firstRow As Long, ie As Object)
Add ByRef, ByVal to signature
Public Sub MM_description(ByVal currentValue As String, ByVal n As Long, ByVal firstRow As Long, ByVal ie As Object)
Remove redundant Call keyword when calling the sub and remove the () as this is a sub with params
Call MM_description(currentValue, n, firstRow, ie) > MM_description currentValue, n, firstRow, ie
As you are passing ie to the sub MM_description you don't want to then deference it and instantiate a new instance inside the called sub. So, remove
Set ie = Nothing
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = False
from inside MM_description
Inside the called sub:
Remove the () from
ie.Navigate2 (currentUrl)
So
ie.Navigate2 currentUrl
and use a proper page load wait. So replace:
Do While ie.Busy
Application.Wait DateAdd("s", 1, Now)
Loop
with
while .busy or .readystate <> 4:wend
Remove unused variables e.g. Dim html As HTMLDocument, and declare all others that are used e.g. Dim mes As String. Put Option Explicit at the top of your module to check for consistency of variable spellings and declarations.
Now, I would actually convert this sub, MM_description, into a function that returns the scraped string value, or "Not Found", and populates the output array in the same loop which calls the function.
If this is now a function the signature needs a return type specified, the call to the function needs an assignment and the () come back as there is evaluation.
output(i) = MM_description(currentUrl, n, firstRow, ie)
Finally, write out the output array to whichever range you want the output values in one go.
Worksheets("Input").Range("F1").Resize(UBound(output), 1) = Application.Transpose(output)
Many of the above changes would lead to a structure like:
Option Explicit
Public Sub LooperForMMDescription()
Dim currentUrl As String, i As Long
Dim ie As Object, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set table = ws.ListObjects("Table6")
Dim arr(), table As ListObject, output()
arr = Application.Transpose(table.ListColumns(1).DataBodyRange.Value)
ReDim output(1 To UBound(arr))
Set ie = CreateObject("InternetExplorer.Application")
If IsEmpty(arr) Then Exit Sub
ie.Visible = True
For i = LBound(arr) To UBound(arr)
currentUrl = arr(i)
If InStr(currentUrl, "http") > 0 Then 'test
'call the sub and extract value
output(i) = MM_description(currentUrl, i, ie)
End If
Next i
ie.Quit
ThisWorkbook.Worksheets("Input").Range("F1").Resize(UBound(output), 1) = Application.Transpose(output)
End Sub
Public Function MM_description(ByVal currentUrl As String, ByVal i As Long, ByVal ie As Object) As String
Dim codeLine As String, startPos As Long, endPos As Long, mes As String
With ie
.Navigate2 currentUrl
While .Busy Or .readyState < 4: DoEvents: Wend
mes = .document.body.innerHTML
startPos = InStr(mes, "Description") + 61
endPos = InStr(mes, "Address")
If startPos = 0 Then
MM_description = "Not Found"
Else
codeLine = Mid$(mes, startPos, endPos - startPos - 229)
MM_description = codeLine
End If
End With
End Function

Related

Bulk Email Extractor. Need to add a Regxp - excel VBA

I am trying to update my first code with parts of a second code, but have been struggling for a few weeks now. Both codes extract emails from a url list.
THE FIRST CODE
This code is fine apart from it uses Mailto: rather than a Regxp. I am trying to replace the Mailto: in the first code with a Regxp from the second code as the regxp extracts more emails.
How the First code works.
A list of urls are placed in Sheet2 "Urls" and the results are show in Sheet1 "Results". This code will extract ALL emails from the site, so if there are 10 emails it will extract ALL 10, if 100 then it will extract All 100 EMAIL. The SECOND code only extracts 1 email per site.
The problem with the FIRST CODE is that the Mailto: does MISS a lot of email where as the REGXP captures more, and so I want to update the first code. However I have been struggling for a few weeks now as I am not super hot in writing code. The first code was originaly written by me and then updated by another developer and his code is wayout of my skill depth. I have been trying for weeks to update it but can not seem to work anything out so I decided to post.
I have listed the first code below. You can download a sample workbook from here Download Link First code
Link to my second code that I also wrote, and the EMAIL REGXP i am trying to use, My Post StackOver Flow
FIRST CODE, Bulk Email Extractor. ONLY EMAIL PART NEEDS UPDATING WITH REGXP.
Sub ScrapeSoMeAndMailAddresses()
'Columns for both tables
Const colUrl As Long = 1 'Must always be the first column
Const colMail As Long = 2 'Must always be the first column before Some platforms
Const colFacebook As Long = 3 'Must always be the last column of Some platforms
Const colError As Long = 4 'Must always be the last column
Dim url As String
Dim http As Object
Dim htmlDoc As Object
Dim nodeAllLinks As Object
Dim nodeOneLink As Object
Dim pageLoadSuccessful As Boolean
Dim tableUrlsOneAddressLeft As String
Dim tableAllAddresses As String
Dim currentRowTableUrls As Long
Dim lastRowTableUrls As Long
Dim currentRowsTableAll(colUrl To colFacebook) As Long
Dim lastRowTableAll As Long
Dim addressCounters(colMail To colFacebook) As Long
Dim checkCounters As Long
'Initialize variables
tableUrlsOneAddressLeft = "Urls" ''Name of Sheet
currentRowTableUrls = 2 'First row for content
tableAllAddresses = "Results" ''Name of Sheet
For checkCounters = colUrl To colFacebook
currentRowsTableAll(checkCounters) = 2 'First rows for content
Next checkCounters
Set htmlDoc = CreateObject("htmlfile")
Set http = CreateObject("MSXML2.ServerXMLHTTP.6.0")
'Clear all contents and comments in the URL source sheet from email column to error column
With Sheets(tableUrlsOneAddressLeft)
lastRowTableUrls = .Cells(Rows.Count, colUrl).End(xlUp).Row
.Range(.Cells(currentRowTableUrls, colMail), .Cells(lastRowTableUrls, colError)).ClearContents
.Range(.Cells(currentRowTableUrls, colMail), .Cells(lastRowTableUrls, colError)).ClearComments
End With
'Delete all rows except headline in the sheet with all addresses
lastRowTableAll = Sheets(tableAllAddresses).Cells(Rows.Count, colUrl).End(xlUp).Row
Sheets(tableAllAddresses).Rows(currentRowsTableAll(colUrl) & ":" & lastRowTableAll).Delete Shift:=xlUp
'Loop over all URLs in column A in the URL source sheet
Do While Sheets(tableUrlsOneAddressLeft).Cells(currentRowTableUrls, 1).Value <> ""
'Scroll for visual monitoring, if 'the sheet with the URLs are the
'active one
If ActiveSheet.Name = tableUrlsOneAddressLeft Then
If currentRowTableUrls > 14 Then
ActiveWindow.SmallScroll down:=1
End If
Sheets(tableUrlsOneAddressLeft).Cells(currentRowTableUrls, 1).Select
End If
'Get next url from the URL source sheet
url = Sheets(tableUrlsOneAddressLeft).Cells(currentRowTableUrls, colUrl).Value
'Try to load page 'Temporarily disable error handling if 'there is a timeout or onother error
On Error Resume Next
http.Open "GET", url, False
http.send
'Check if page loading was successful
If Err.Number = 0 Then
pageLoadSuccessful = True
End If
On Error GoTo 0
If pageLoadSuccessful Then
'Build html document for DOM operations
htmlDoc.body.innerHtml = http.responseText
'Create node list from all links of the page
Set nodeAllLinks = htmlDoc.getElementsByTagName("a")
'Walk through all links of the node list
For Each nodeOneLink In nodeAllLinks
'''#####################################################################################################
'''################################### THIS IS THE START OF THE EMAIL SECTION ##########################
'''#####################################################################################################
'Check for mail address
If InStr(1, nodeOneLink.href, "mailto:") Then
'Write mail address to both tables
Sheets(tableUrlsOneAddressLeft).Cells(currentRowTableUrls, colMail).Value = Right(nodeOneLink.href, Len(nodeOneLink.href) - InStr(nodeOneLink.href, ":"))
Sheets(tableAllAddresses).Cells(currentRowsTableAll(colMail), colMail).Value = Right(nodeOneLink.href, Len(nodeOneLink.href) - InStr(nodeOneLink.href, ":"))
'Check if it is a new line in the sheet with all addresses
If currentRowsTableAll(colMail) >= currentRowsTableAll(colUrl) Then
'Write URL in the new line of the sheet with all addresses
Sheets(tableAllAddresses).Cells(currentRowsTableAll(colUrl), colUrl).Value = url
'Increment url counter
currentRowsTableAll(colUrl) = currentRowsTableAll(colUrl) + 1
End If
'Increment mail counters
currentRowsTableAll(colMail) = currentRowsTableAll(colMail) + 1
addressCounters(colMail) = addressCounters(colMail) + 1
End If
'''#####################################################################################################
'''################################### END OF THE EMAIL SECTION ########################################
'''#####################################################################################################
'Check for Facebook address
If InStr(1, UCase(nodeOneLink.href), "FACEBOOK") Then
'Write Facebook address to both tables
Sheets(tableUrlsOneAddressLeft).Cells(currentRowTableUrls, colFacebook).Value = nodeOneLink.href
Sheets(tableAllAddresses).Cells(currentRowsTableAll(colFacebook), colFacebook).Value = nodeOneLink.href
'Check if it is a new line in the sheet with all addresses
If currentRowsTableAll(colFacebook) >= currentRowsTableAll(colUrl) Then
'Write URL in the new line of the sheet with all addresses
Sheets(tableAllAddresses).Cells(currentRowsTableAll(colUrl), colUrl).Value = url
'Increment url counter
currentRowsTableAll(colUrl) = currentRowsTableAll(colUrl) + 1
End If
'Increment Facebook counters
currentRowsTableAll(colFacebook) = currentRowsTableAll(colFacebook) + 1
addressCounters(colFacebook) = addressCounters(colFacebook) + 1
End If
Next nodeOneLink
'Check address counters
For checkCounters = colMail To colFacebook
'Set comment if more than 1 link were found
If addressCounters(checkCounters) > 1 Then
Sheets(tableUrlsOneAddressLeft).Cells(currentRowTableUrls, checkCounters).AddComment Text:=CStr(addressCounters(checkCounters))
Sheets(tableUrlsOneAddressLeft).Cells(currentRowTableUrls, checkCounters).Comment.Shape.TextFrame.AutoSize = True
End If
Next checkCounters
Else
'Page not loaded
'Write message URL table
Sheets(tableUrlsOneAddressLeft).Cells(currentRowTableUrls, colError).Value = "Error with URL or timeout"
End If
'Prepare for next page
pageLoadSuccessful = False
Erase addressCounters
lastRowTableAll = Sheets(tableAllAddresses).Cells(Rows.Count, colUrl).End(xlUp).Row
For checkCounters = colUrl To colFacebook
currentRowsTableAll(checkCounters) = lastRowTableAll + 1 'First rows for next page content
Next checkCounters
currentRowTableUrls = currentRowTableUrls + 1
Loop
'Clean up
Set http = Nothing
Set htmlDoc = Nothing
Set nodeAllLinks = Nothing
Set nodeOneLink = Nothing
End Sub
As always thanks in advance.

I recommend you save a copy of your workbook before running the code in my answer. Hopefully it works and does what you need it to, I wasn't able to test it fully.
The GetEmailAddressesFromHtml function in the code below extracts email addresses using the regular expression you've included in your answer.
I think one thing the code below doesn't do (which your code did) was add comments to the Urls worksheet. But you could set up an Excel formula (e.g. COUNTIFS) to do that in my opinion.
I only add unique Facebook URLs and email addresses, so you shouldn't see any duplicates on the Results sheet.
Option Explicit
Option Private Module 'This option means that Subs of this module are not displayed in the macros of the Excel GUI
' https://learn.microsoft.com/en-us/office/vba/language/reference/user-interface-help/raise-method
Private Const ERR_REQUEST_FAILED As Long = 513
Private Const ERR_INVALID_HTML As Long = 514
Private Sub PrepareSourceSheet( _
ByVal someSheet As Worksheet, ByVal firstRowToClear As Long, ByVal lastRowToClear As Long, _
ByVal firstColumnToClear As Long, ByVal lastColumnToClear As Long)
' Should clear all contents and comments in the source sheet from email column to error column
With someSheet
With .Range(.Cells(firstRowToClear, firstColumnToClear), .Cells(lastRowToClear, lastColumnToClear))
Debug.Assert Intersect(.Cells, .Parent.Columns(1)) Is Nothing
.ClearContents
.ClearComments
End With
End With
End Sub
Private Sub PrepareDestinationSheet(ByVal someSheet As Worksheet, ByVal firstRowToDelete As Long)
'Should delete all rows starting from and including "firstRowToDelete".
With someSheet
.rows(firstRowToDelete & ":" & .rows.CountLarge).Delete Shift:=xlUp
End With
End Sub
Private Sub ScrapeSomeData()
'Columns for both tables
Const COLUMN_URL As Long = 1 'Must always be the first column
Const COLUMN_EMAIL As Long = 2 'Must always be the first column before Some platforms
Const COLUMN_FACEBOOK As Long = 3 'Must always be the last column of Some platforms
Const COLUMN_ERROR As Long = 4 'Must always be the last column
Const FIRST_SOURCE_ROW As Long = 2 ' Skip headers
Const FIRST_DESTINATION_ROW As Long = 2 ' Skip headers
Dim sourceSheet As Worksheet
Set sourceSheet = ThisWorkbook.Worksheets("Urls")
Dim destinationSheet As Worksheet
Set destinationSheet = ThisWorkbook.Worksheets("Results")
Dim lastSourceRow As Long
lastSourceRow = sourceSheet.Cells(sourceSheet.rows.Count, COLUMN_URL).End(xlUp).Row
PrepareSourceSheet someSheet:=sourceSheet, firstRowToClear:=FIRST_SOURCE_ROW, _
lastRowToClear:=lastSourceRow, firstColumnToClear:=COLUMN_EMAIL, lastColumnToClear:=COLUMN_ERROR
PrepareDestinationSheet someSheet:=destinationSheet, firstRowToDelete:=FIRST_DESTINATION_ROW
Dim destinationRowIndex As Long
destinationRowIndex = FIRST_DESTINATION_ROW
Dim sourceRowIndex As Long
For sourceRowIndex = FIRST_SOURCE_ROW To lastSourceRow
Dim data As Collection
Set data = GetDataForUrl(sourceSheet.Cells(sourceRowIndex, COLUMN_URL))
With destinationSheet
Dim currentRowData As Variant
For Each currentRowData In data
.Cells(destinationRowIndex, COLUMN_URL).Value = currentRowData("url")
.Cells(destinationRowIndex, COLUMN_EMAIL).Value = currentRowData("emailAddress")
.Cells(destinationRowIndex, COLUMN_FACEBOOK).Value = currentRowData("facebookUrl")
.Cells(destinationRowIndex, COLUMN_ERROR).Value = currentRowData("errorMessage")
destinationRowIndex = destinationRowIndex + 1
Next currentRowData
End With
With sourceSheet
.Cells(sourceRowIndex, COLUMN_EMAIL).Value = data(1)("emailAddress")
.Cells(sourceRowIndex, COLUMN_FACEBOOK).Value = data(1)("facebookUrl")
.Cells(sourceRowIndex, COLUMN_ERROR).Value = data(1)("errorMessage")
End With
DoEvents
Next sourceRowIndex
End Sub
Private Function GetHtmlFromUrl(ByVal someUrl As String) As Object
' Should return a HTML document. Raises an error if URL is unavailable
' (at the time of requesting) or if HTML could not be assigned.
Dim httpClient As Object
Set httpClient = CreateObject("MSXML2.ServerXMLHTTP.6.0")
Dim succeeded As Boolean
httpClient.Open "GET", someUrl, False
On Error Resume Next
httpClient.Send
succeeded = (0 = Err.Number)
On Error GoTo 0
If Not succeeded Then Err.Raise ERR_REQUEST_FAILED, , "Error with URL or timeout"
Dim htmlDocument As Object
Set htmlDocument = CreateObject("htmlfile")
On Error Resume Next
htmlDocument.body.innerHTML = httpClient.responseText
succeeded = (0 = Err.Number)
On Error GoTo 0
If Not succeeded Then Err.Raise ERR_INVALID_HTML, , "Error whilst assigning HTML"
Set GetHtmlFromUrl = htmlDocument
End Function
Private Function GetFacebookUrlsFromHtml(ByVal htmlDocument As Object) As Collection
' Should return a collection of strings that are Facebook URLs detected.
' This function only looks within anchor tags.
Dim outputCollection As Collection
Set outputCollection = New Collection
Dim allAnchorTags As Object
Set allAnchorTags = htmlDocument.getElementsByTagName("a")
Dim anchorTag As Object
For Each anchorTag In allAnchorTags
If InStr(1, UCase$(anchorTag.href), "FACEBOOK", vbBinaryCompare) > 0 Then
On Error Resume Next
outputCollection.Add anchorTag.href, Key:=anchorTag.href ' De-duplicate here
On Error GoTo 0
End If
Next anchorTag
Set GetFacebookUrlsFromHtml = outputCollection
End Function
Private Function GetEmailAddressesFromHtml(ByVal htmlDocument As Object) As Collection
' Should return a collection of strings representing email addresses detected
' in the HTML document.
Dim outputCollection As Collection
Set outputCollection = New Collection
Dim regEx As Object
Set regEx = CreateObject("VBScript.RegExp")
With regEx
.Pattern = "[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
.Global = True
Dim emailMatches As Object
Set emailMatches = .Execute(htmlDocument.body.innerHTML)
End With
Dim matchFound As Object
For Each matchFound In emailMatches
On Error Resume Next ' De-duplicate here.
outputCollection.Add matchFound.Value, Key:=matchFound.Value
On Error GoTo 0
Next matchFound
Set GetEmailAddressesFromHtml = outputCollection
End Function
Private Function CreateRowDataForError(ByVal wasSuccess, ByVal errorMessage, ByVal someUrl As String) As Collection
' Context: An error has occurred and we don't have any data (Facebook URLs, email addresses).
' So can only return the URL attempted and the error message.
Dim nestedCollection As Collection
Set nestedCollection = New Collection
nestedCollection.Add wasSuccess, "wasSuccess"
nestedCollection.Add errorMessage, "errorMessage"
nestedCollection.Add someUrl, "url"
nestedCollection.Add vbNullString, "facebookUrl"
nestedCollection.Add vbNullString, "emailAddress"
Set CreateRowDataForError = New Collection
CreateRowDataForError.Add nestedCollection
Debug.Assert 1 = CreateRowDataForError.Count
End Function
Private Function CreateRowDataForResults(ByVal wasSuccess As Boolean, ByVal errorMessage As String, _
ByVal someUrl As String, ByVal facebookUrls As Collection, ByVal emailAddresses As Collection) As Collection
' Context: No error occurred. HTML document may or may not contain data,
' but logic below should handle both scenarios.
Dim nestedCollection As Collection
Dim outerCollection As Collection
Set outerCollection = New Collection
Dim i As Long
For i = 1 To Application.Max(1, facebookUrls.Count, emailAddresses.Count)
Set nestedCollection = New Collection
nestedCollection.Add wasSuccess, Key:="wasSuccess"
nestedCollection.Add errorMessage, Key:="errorMessage"
nestedCollection.Add someUrl, Key:="url"
nestedCollection.Add GetCollectionItemOrDefault(facebookUrls, i, vbNullString), Key:="facebookUrl"
nestedCollection.Add GetCollectionItemOrDefault(emailAddresses, i, vbNullString), Key:="emailAddress"
outerCollection.Add nestedCollection
Next i
Debug.Assert outerCollection.Count = Application.Max(1, facebookUrls.Count, emailAddresses.Count)
Set CreateRowDataForResults = outerCollection
End Function
Private Function GetDataForUrl(ByVal someUrl As String) As Collection
' Currently this function misuses Collection class. Should probably instead write a class to return a custom object/data structure.
' Returns a collection of nested collections, where each nested collection is as below:
' • "wasSuccess" = whether data was successfully retrieved
' • "errorMessage" = an error message mentioning what happened
' • "facebookUrl" = a Facebook URL detected
' • "emailAddress" = an email address detected
Dim wasSuccess As Boolean
Dim errorMessage As String
Dim htmlDocument As Object
On Error Resume Next
Set htmlDocument = GetHtmlFromUrl(someUrl)
wasSuccess = (0 = Err.Number)
If Not wasSuccess Then
errorMessage = IIf(ERR_REQUEST_FAILED = Err.Number Or ERR_INVALID_HTML = Err.Number, Err.Description, "Unexpected error occurred")
End If
On Error GoTo 0
If Not wasSuccess Then
Set GetDataForUrl = CreateRowDataForError(wasSuccess, errorMessage, someUrl)
Else
Dim facebookUrls As Collection
Set facebookUrls = GetFacebookUrlsFromHtml(htmlDocument)
Dim emailAddresses As Collection
Set emailAddresses = GetEmailAddressesFromHtml(htmlDocument)
Set GetDataForUrl = CreateRowDataForResults(wasSuccess, errorMessage, someUrl, facebookUrls, emailAddresses)
End If
End Function
Private Function GetCollectionItemOrDefault(ByVal someCollection As Collection, ByVal someKey As Variant, ByVal someDefaultValue As Variant) As Variant
' Assumes item is not an object. This function will return false negatives if item being retrieved is an object.
Dim succeeded As Boolean
On Error Resume Next
GetCollectionItemOrDefault = someCollection(someKey)
succeeded = (0 = Err.Number)
On Error GoTo 0
If Not succeeded Then GetCollectionItemOrDefault = someDefaultValue
End Function
From a maintenance and coding perspective, I think something like Node.js or Python would allow you to get the same work done in fewer lines of code.

Is there a way to slow down a Web Scraper so it will pick up the code?

I wrote a macro to go to WU to get historical data and for the most part, it works. However, I believe that the macro is running too fast for it to pick up the data from the website.
https://www.wunderground.com/history/daily/us/tx/el-paso/KELP/date/2017-1-3
Is the website and the table I want to get is tablesaw-sortable.
I have tried the following: DoEvents and Application.Wait (Now + TimeValue("00:00:01")) to try to slow down the process.
Sub BrowseToWU()
Dim IE As New SHDocVw.InternetExplorer
Dim HTMLDoc As MSHTML.HTMLDocument
Dim RowAddress As Integer
Dim WebAddress As String
Dim DateSheet As Date
Dim WkDay As Integer
Dim DateSheetName As String
'Application.ScreenUpdating = False
'Application.StatusBar = True
RowAddress = 2
IE.Visible = True
Do Until RowAddress = 60
WebAddress = Range("A" & RowAddress)
DateSheet = Right(WebAddress, 8)
DateSheetName = Right(WebAddress, 8)
WkDay = Weekday(DateSheet, vbSunday)
If WkDay < 3 Then
RowAddress = RowAddress + 1
ElseIf WkDay > 6 Then
RowAddress = RowAddress + 1
Else
IE.Navigate WebAddress
Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop
Set HTMLDoc = IE.Document
DoEvents
Application.Wait (Now + TimeValue("00:00:05"))
DoEvents
ProcessHTMLPage HTMLDoc
DateSheet = Right(WebAddress, 8)
DoEvents
Application.Wait (Now + TimeValue("00:00:01"))
ActiveSheet.Name = DateSheetName
DoEvents
RowAddress = RowAddress + 1
'IE.Quit
Worksheets("Sheet1").Activate
End If
Loop
End Sub
Option Explicit
Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)
Dim HTMLTable As MSHTML.IHTMLElement
Dim HTMLTables As MSHTML.IHTMLElementCollection
Dim HTMLRow As MSHTML.IHTMLElement
Dim HTMLCell As MSHTML.IHTMLElement
Dim RowNum As Long, ColNum As Integer
'Dim IE As New SHDocVw.InternetExplorer
'Dim Ws As Worksheet
Set HTMLTables = HTMLPage.getElementsByClassName("tablesaw-sortable")
'DoEvents
For Each HTMLTable In HTMLTables
Worksheets.Add
DoEvents
Range("A1").Value = HTMLTable.className
Range("B1").Value = Now
RowNum = 2
For Each HTMLRow In HTMLTable.getElementsByTagName("tr")
'Debug.Print vbTab & HTMLRow.innerText
ColNum = 1
For Each HTMLCell In HTMLRow.Children
Cells(RowNum, ColNum) = HTMLCell.innerText
ColNum = ColNum + 1
Next HTMLCell
RowNum = RowNum + 1
Next HTMLRow
Next HTMLTable
DoEvents
'IE.Quit
End Sub
The macro is supposed to run through sheet1 picking up the web address to the historical data if it satisfies the criteria of being a certain day of the week.
IE will open and then it will kick over to the next module that will take in the data.
A new worksheet is created and the data pasted into the new worksheet.
The worksheet is renamed to the date of the data.
The web address sheet is activated again and the process starts over again.
The error I get is that the data isn't taken from the website, so the For statement ends and the web address sheet is renamed and an error occurs.

One way around this is to call the API that the page is using to get that info.
The API returns json which you can parse with a json parser. I use jsonconverter.bas. After installing the code from that link in a standard module called JsonConverter, go to VBE > Tools > References > Add a reference to Microsoft Scripting Runtime.
Finding the API:
If you press F12 to open developer tools and go to the Network tab and then press F5 to refresh any url of interest you will see the recorded web traffic. You can find the API call there.
See my answer here on how to search the network traffic using a specific observation value you expect to see in the response - this will filter the list of network traffic to those items containing the value of interest. Be sensible in selecting the value - you want something unlikely to occur elsewhere. You can also filter the network traffic to XHR only.
The API response:
The API returns json. More specifically, it returns a dictionary containing 2 keys. The second key, "observations", can be used to return a collection (denoted by []) of dictionaries (denoted by {}).
Each dictionary represents a row of the table (daily observations). You can loop this collection, and then loop the inner dictionaries, to access the table row values and reconstruct the table by populating an array. Explore example json response here.
Explanation of json structure:
click here to enlarge
Explanation of code:
The code is broken down into a number of helper subs and functions, allocating certains tasks to each, to
make code easier to debug and follow, as well as better align with Object Oriented Programming Principles.
Overall the process is:
Gather urls for Worksheet("Sheet1"). Helper function GetAllUrls.
Process those urls and only retain the dates which correspond with Tue-Thur. These are kept as strings formatted as "yyyymmdd" so can be passed to API later. This is handled by helper functions GetOnlyQualifyingUrlsDates and IncludeThisDate. IncludeThisDate performs the check for whether to include; GetOnlyQualifyingUrlsDates handles the looping and formatting of results.
Issue xmlhttp requests by looping over qualifying url dates and concatenating those into the url for the API call, then issuing the request. This is performed by the main sub GetTables.
Sheet creation, for output, is handled by helper function CreateWorksheet. This function calls another helper function, SheetExists, to ensure sheets are only created if they don't already exist, otherwise, the existing sheet by that name is used.
The resultant json response, from step 3, is passed to a helper sub WriteOutResults which accepts the json variable and the output sheet object as arguments. It extracts all the info from the json response; essentially reconstructing the table. It adds the table and headers to the appropriate sheet.
It calls helper function Epoch2Date, which handles the unix timestamp to datetime conversion for the two unix fields in the json object.
TODO:
The API key may be time limited. Add a helper function which returns the current valid key.
The API accepts start date and end date parameters in the url construct. It would be far better to issue one request for the entire range if possible, or chunked ranges e.g. months, to reduce the number of requests made. This would also reduce the likelihood of being blocked. This would mean some additional code would need to be written, before writing out results, to ensure only dates of interest are being written to sheets. Though you could write out all then simply loop all sheets and delete those that aren't wanted (perfectly doable if we are talking about 365 dates total). Personally, I would handle the include date part in the construction of the table from a single request (if possible) that has the min and max dates for entire urls listed passed as start and end date parameters. I would then write a single flat table out to one sheet as this will be much easier for later data analysis.
VBA:
Option Explicit
Public Sub GetTables()
'VBE > Tools > References > Microsoft Scripting Runtime
Dim json As Object, qualifyingUrlsDates(), urls(), url As String
Dim ws As Worksheet, wsOutput As Worksheet, i As Long, startDate As String, endDate As String
Set ws = ThisWorkbook.Worksheets("Sheet1")
urls = GetAllUrls(2, ws, "A")
qualifyingUrlsDates = GetOnlyQualifyingUrlsDates(urls)
'API key may be not be valid over time so look at obtaining by prior request
With CreateObject("MSXML2.XMLHTTP") 'issue xmlhttp request for each valid date (this would be better done using start and enddate to specify entire range _
of batches e.g. months within total range to cut down on requests
For i = LBound(qualifyingUrlsDates) To UBound(qualifyingUrlsDates)
startDate = qualifyingUrlsDates(i)
endDate = startDate ' a little verbose but useful for explaining
url = "https://api.weather.com/v1/geocode/31.76/-106.49/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&startDate=" & startDate & "&endDate=" & endDate & "&units=e"
.Open "GET", url, False
.send
Set json = JsonConverter.ParseJson(.responseText)("observations")
Set wsOutput = CreateWorksheet(qualifyingUrlsDates(i))
WriteOutResults wsOutput, json
Next
End With
End Sub
Public Sub WriteOutResults(ByVal wsOutput As Worksheet, ByVal json As Object)
'json is a collection of dictionaries. Each dictionary is a time period reading from the day i.e. one row in output
Dim results(), item As Object, headers(), r As Long, c As Long, key As Variant
headers = json.item(1).keys 'get the headers which are the keys of each dictionary
ReDim results(1 To json.Count, 1 To UBound(headers) + 1)
For Each item In json
r = r + 1: c = 0 'increase row in results array to store results for table row
For Each key In item.keys
c = c + 1 'increase column number in results array for writing out results
Select Case key
Case "valid_time_gmt", "expire_time_gmt" 'convert unix timestamp fields to datetime
results(r, c) = Epoch2Date(item(key))
Case Else
results(r, c) = item(key)
End Select
Next
Next
With wsOutput
.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Public Function GetOnlyQualifyingUrlsDates(ByVal urls As Variant) As Variant
Dim i As Long, output(), counter As Long
ReDim output(1 To UBound(urls))
For i = LBound(urls) To UBound(urls)
If IncludeThisDate(urls(i)) Then 'check if weekday is to be included
counter = counter + 1
output(counter) = Format$(Right$(urls(i), 8), "yyyymmdd") 'if to include then add to output array of urls of interest
End If
Next
ReDim Preserve output(1 To counter)
GetOnlyQualifyingUrlsDates = output
End Function
Public Function IncludeThisDate(ByVal url As String) As Boolean
'tue, wed, thurs are valid
IncludeThisDate = Not IsError(Application.Match(Weekday(Right$(url, 8), vbSunday), Array(3, 4, 5)))
End Function
Public Function SheetExists(ByVal sheetName As String) As Boolean '<== function by #Rory
SheetExists = Evaluate("ISREF('" & sheetName & "'!A1)")
End Function
Public Function GetAllUrls(ByVal startRow As Long, ByVal ws As Worksheet, ByVal columnName As String) As Variant
'transpose used based on premise no more than a couple of years of dates
'startRow is start row for urls, ws is sheet where urls found, columnName is string representation of column for urls e.g. "A"
With ws
GetAllUrls = Application.Transpose(ws.Range("A" & startRow & ":A" & .Cells(.rows.Count, columnName).End(xlUp).Row).Value)
End With
End Function
Public Function CreateWorksheet(ByVal sheetName As String) As Worksheet
Dim ws As Worksheet
If SheetExists(sheetName) Then
Set ws = ThisWorkbook.Worksheets(sheetName)
'do something.... clear it? Then add new data to it?
Else
Set ws = ThisWorkbook.Worksheets.Add
ws.Name = sheetName
End If
Set CreateWorksheet = ws
End Function
Public Function Epoch2Date(ByVal E As Currency, Optional msFrac) As Date '# Schmidt http://www.vbforums.com/showthread.php?805245-EPOCH-to-Date-and-vice-versa
Const Estart As Double = #1/1/1970#
msFrac = 0
If E > 10000000000# Then E = E * 0.001: msFrac = E - Int(E)
Epoch2Date = Estart + (E - msFrac) / 86400
End Function

How to scrape data from Bloomberg's website with VBA

Background
Disclaimer: I am a beginner, please bare with my - most plausibly wrong - code.
I want to update currency pairs' value (PREV CLOSE) with a button-enabled-VBA macro. My Excel worksheet contains FX pairs (e.g. USDGBP) on column G:G which are then used to run a FOR loop for every pair in the column.
The value would then be stored in column I:I
Right now, the problem according to the Debugger lies in one line of code that I will highlight below
Sources
I got some inspiration from https://www.youtube.com/watch?v=JxmRjh-S2Ms&t=1050s - notably 17:34 onwards - but I want my code to work for multiple websites at the press of a button.
I have tried the following code
Public Sub Auto_FX_update_BMG()
Application.ScreenUpdating = False 'My computer is not very fast, thus I use this line of
'code to save some computing power and time
Dim internet_object As InternetExplorer
Dim i As Integer
For i = 3 To Sheets(1).Cells(3, 7).End(xlDown).Row
FX_Pair = Sheets(1).Cells(i, 7)
Set internet_object = New InternetExplorer
internet_object.Visible = True
internet_object.navigate "https://www.bloomberg.com/quote/" & FX_Pair & ":CUR"
Application.Wait Now + TimeValue("00:00:05")
internet_object.document.getElementsByClassName("class")(0).getElementsByTagName ("value__b93f12ea") '--> DEBUGGER PROBLEM
'My goal here is to "grab" the PREV CLOSE
'value from the website
With ActiveSheet
.Range(Cells(i, 9)).Value = HTML_element.Children(0).textContent
End With
Sheets(1).Range(Cells(i, 9)).Copy 'Not sure if these 2 lines are unnecesary
ActiveSheet.Paste
Next i
Application.ScreenUpdating = True
End Sub
Expected Result
WHEN I enter "USDGBP" on a cell on column G:G, the macro would go to https://www.bloomberg.com/quote/EURGBP:CUR and "grab" the PREV CLOSE value of 0.8732 (using today's value) and insert it in the respective row of column I:I
As of now, I am just facing the debugger without much idea on how to solve the problem.

You can use class selectors in a loop. The pattern
.previousclosingpriceonetradingdayago .value__b93f12ea
specifies to get child elements with class value__b93f12ea having parent with class previousclosingpriceonetradingdayago. The "." in front is a css class selector and is a faster way of selecting as modern browsers are optimized for css. The space between the two classes is a descendant combinator. querySelector returns the first match for this pattern from the webpage html document.
This matches on the page:
You can see the parent child relationship and classes again here:
<section class="dataBox previousclosingpriceonetradingdayago numeric">
<header class="title__49417cb9"><span>Prev Close</span></header>
<div class="value__b93f12ea">0.8732</div>
</section>
N.B. If you are a Bloomberg customer look into their APIs. Additionally, it is very likely you can get this same info from other dedicated APIs which will allow for much faster and more reliable xhr requests.
VBA (Internet Explorer):
Option Explicit
Public Sub test()
Dim pairs(), ws As Worksheet, i As Long, ie As Object
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set ie = CreateObject("InternetExplorer.Application")
With ws
pairs = Application.Transpose(.Range("G2:G" & .Cells(.rows.Count, "G").End(xlUp).Row).Value) ' assumes pairs start in row 2
End With
Dim results()
ReDim results(1 To UBound(pairs))
With ie
.Visible = True
For i = LBound(pairs) To UBound(pairs)
.Navigate2 "https://www.bloomberg.com/quote/" & pairs(i) & ":CUR", False
While .Busy Or .readyState < 4: DoEvents: Wend
results(i) = .document.querySelector(".previousclosingpriceonetradingdayago .value__b93f12ea").innerText
Next
.Quit
End With
ws.Cells(2, "I").Resize(UBound(results), 1) = Application.Transpose(results)
End Sub
For very limited numbers of requests (as leads to blocking) you could use xhr request and regex out the value. I assume pairs are in sheet one and start from G2. I also assume there are no empty cells or invalid pairs in column G up to an including last pair to search for. Otherwise, you will need to develop the code to handle this.
Try regex here
Option Explicit
Public Sub test()
Dim re As Object, pairs(), ws As Worksheet, i As Long, s As String
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set re = CreateObject("VBScript.RegExp")
With ws
pairs = Application.Transpose(.Range("G2:G" & .Cells(.rows.Count, "G").End(xlUp).Row).Value) ' assumes pairs start in row 2
End With
Dim results()
ReDim results(1 To UBound(pairs))
With CreateObject("MSXML2.XMLHTTP")
For i = LBound(pairs) To UBound(pairs)
.Open "GET", "https://www.bloomberg.com/quote/" & pairs(i) & ":CUR", False
.send
s = .responseText
results(i) = GetCloseValue(re, s, "previousClosingPriceOneTradingDayAgo%22%3A(.*?)%2")
Next
End With
ws.Cells(2, "I").Resize(UBound(results), 1) = Application.Transpose(results)
End Sub
Public Function GetCloseValue(ByVal re As Object, inputString As String, ByVal pattern As String) As String 'https://regex101.com/r/OAyq30/1
With re
.Global = True
.MultiLine = True
.IgnoreCase = False
.pattern = pattern
If .test(inputString) Then
GetCloseValue = .Execute(inputString)(0).SubMatches(0)
Else
GetCloseValue = "Not found"
End If
End With
End Function

Try below code:
But before make sure to add 2 reference by going to Tools> References > then look for Microsoft HTML Object Library and Microsoft Internet Controls
This code works upon using your example.
Sub getPrevCloseValue()
Dim ie As Object
Dim mySh As Worksheet
Set mySh = ThisWorkbook.Sheets("Sheet1")
Dim colG_Value As String
Dim prev_value As String
For a = 3 To mySh.Range("G" & Rows.Count).End(xlUp).Row
colG_Value = mySh.Range("G" & a).Value
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = True
ie.navigate "https://www.bloomberg.com/quote/" & colG_Value & ":CUR"
Do While ie.Busy: DoEvents: Loop
Do Until ie.readyState = 4: DoEvents: Loop
'Application.Wait (Now + TimeValue("00:00:03")) 'activate if having problem with delay
For Each sect In ie.document.getElementsByTagName("section")
If sect.className = "dataBox previousclosingpriceonetradingdayago numeric" Then
prev_value = sect.getElementsByTagName("div")(0).innerText
mySh.Range("I" & a).Value = prev_value
Exit For
End If
Next sect
Next a
I have a video tutorial for basic web automation using vba which include web data scraping and other commands, please check the link below:
https://www.youtube.com/watch?v=jejwXID4OH4&t=700s

VBA Code Scraper not placing data in right columns

The code works fine, but I need it to extract ONLY emails and URLs and place the email in Sheet1 "Scraper" NEXT BLANK ROW
Emails = Column A
Urls = Column B
Currently it extracts anything text, emails or URL and places them in column A or B.
I only need Emails or URLs. I have been stuck on this for sometime and can't seem to work it out
Also I am not sure if my DELETE DUPLICATES is deleting duplicate rows or duplicates in column. It SHOULD be duplicate rows.
How the code works:
On Sheet2 "URL List" I have a list of URLs, the code runs through this and places the results onto Sheet1 "Scraper". and deletes any duplicates
It is only supposed to scraper email and URLs and place them in Column A,B on NEXT BLANK ROW.
I have tried to fix the problem but it is out of my scope.
Private Sub fbStart_Click()
'Set sheet2 URL List and open Internet Explorer
Dim lr As Long
Dim x As Long
Dim arr() As Variant
Dim wks As Worksheet
Dim ie As Object
Dim dd(1 To 2) As String
Dim Fr As Long
On Error Resume Next
Application.ScreenUpdating = False
Set wks = ThisWorkbook.Sheets("Url List")
With wks
Fr = .Cells(.Rows.Count, 6).End(xlUp).Offset(1).Row
lr = .Cells(.Rows.Count, 1).End(xlUp).Row
.Cells(1, 5).Value = lr
arr = .Range(.Cells(Fr, 1), .Cells(lr, 1)).Value
End With
'Show Internet Explorer and add delay in seconds if needed
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = True
Application.Wait Now + TimeValue("0:00:0")
For x = LBound(arr, 1) To UBound(arr, 1)
.navigate arr(x, 1)
wtime = Time
Do While .Busy Or .readyState <> 4
DoEvents
'Skip pages with Captchas + write the word Captcha in Sheet 2 Column C
If Time > (wtime + TimeValue("00:00:10")) Then
Cells(x + 1, "C").Value = "Captcha"
Exit Do
End If
Loop
On Error Resume Next
'Variable for document or data which need to be extracted out of webpage, change innertext number if same class used
Dim doc As HTMLDocument
Set doc = ie.document
dd(1) = doc.getElementsByClassName("_50f4")(2).innerText
dd(2) = doc.getElementsByClassName("_50f4")(3).innerText
'Paste the web data into Sheet1 "Scraper" in next BLANK ROW
With Sheet1
.Cells(.Rows.Count, "A").End(xlUp).Offset(1, 0).Resize(, 2).Value = dd
End With
' Put A number 1 in Sheet2 "Url List"column B to notify this URL is done
Sheets("Url List").Range("B" & Rows.Count).End(xlUp).Offset(1, 0).Value = 1
'Deletes duplicates in column A Sheet1
Columns(1).RemoveDuplicates Columns:=Array(1)
Columns(2).RemoveDuplicates Columns:=Array(1)
'Count No1 in sheet2 Column B
With Worksheets("Url List")
Lastrow = .Cells(.Rows.Count, "B").End(xlUp).Row
Sheets("Url List").Range("B1").Value = Lastrow
End With
Call Autoclick_Click
Next x
.Quit
End With
'Hide FaceBook Scraper Form
ScraperForm.Hide
End Sub

Below is to show you how to handle finding email and website address. You already have your loop and de-duplicate. Below are helper methods for extracting the required info. You can simply assign from the variables email and website to your cells in loop. I show a method using a helper function to determine lastRow in target sheet and writing out variables to correct columns in one go.
I can help with implementing the loop integration if needed, but the emphasis here was on explaining what could be done for identifying those elements of interest and how to write out to correct columns. Tbh - de-duplicating is so easily done in sheet at end but you can also use macro recorder to get perfectly functional code for that single step/use existing SO answers.
tl;dr;
This would be a lot easier if :contains / :has css pseudo classes were permitted. Instead, my approach is as follows:
email - find the href attribute whose value starts with mailto
website - check that there is a website icon on the page
Specify a parent of both the website icon and the website address
Loop all matches to that parent specification checking if contains the website icon (this is where pseudo class selectors would have simplified things). If match found then we have the shared parent of both icon and hopefully website address; use childOfSiblingCssSelector (we are looking at a child of the following div in this case) css selector to extract the website url.
Notes:
The entire thing is kept quite high level/generic such that you can adjust your css selectors to hopefully cater for different scenarios. Consequence - may seem a little verbose.
Helper functions are provided to handle element matching. Name these in a way that makes sense for what they are doing. I think some room for improvement here.
Whilst technically the second helper, GetText, could be used to extract the email address (I'd probably add another argument to function call to specify attribute to extract) as well as website address, it seems far quicker, currently, to simply target the appropriate href as detailed above.
I have kept the css selectors as local variables close to their usage; you could have them as constants, closer to top of module, where easier to access perhaps? Unsure without seeing how this performs over time/different urls
Css selectors are chosen over .getElementsBy methods for two reasons: 1) there is browser optimization for css selectors so, if well formulated, css will be faster 2) I want to preserve the flexibility of the code/helper functions - you have far more specificity with css selectors in terms of what patterns you can express. I deemed this important as I don't know what future cases you may need to handle.
I am deliberately not using class name and index e.g. doc.getElementsByClassName("_50f4")(2).innerText, as I am unfamiliar with the range of potential use cases; this feels a little fragile as relies on consistent ordering and numbering of elements (at least up to these indices).
TODO:
Rather than instantiate a new HTMLDocument each time in GetText, it is more efficient to pass another HTMLDocument argument in the function signature i.e. from calling procedure. A re-factor could take that into consideration.
This type of coding might lend itself to being class based in the future. Particularly if error handling is to be added and further functions.
VBA:
Option Explicit
'VBE > Tools > References > HTML Object Library
Public Sub test()
Dim ie As Object, ws As Worksheet
Set ie = CreateObject("InternetExplorer.Application")
Set ws = ThisWorkbook.Worksheets("Scraper")
With ie
.Visible = True
.Navigate2 "https://www.facebook.com/pg/SalemFordNH/about/?ref=page_internal%5Blink%5D"
While .Busy Or .readyState < 4: DoEvents: Wend
With .document
Dim email As String, website As String, iconCssSelector As String
'iconCssSelector for website icon in this instance
iconCssSelector = "[src='https://static.xx.fbcdn.net/rsrc.php/v3/yV/r/EaDvTjOwxIV.png']"
If ElementIsPresent(ie.document, "[href^=mailto]") Then
email = ie.document.querySelector("[href^=mailto]").innerText
Else
email = "Not found"
End If
Dim parents As Object, sharedParentCssSelector As String, childOfSiblingCssSelector As String
sharedParentCssSelector = "._5aj7" 'target parent of both icon and the website link
childOfSiblingCssSelector = "._50f4" '< to target website address after finding right parent
If ElementIsPresent(ie.document, iconCssSelector) _
And ElementIsPresent(ie.document, sharedParentCssSelector) Then
Set parents = ie.document.querySelectorAll(sharedParentCssSelector) 'css selector used to allow for greater flexibility in element matching
website = GetText(ie.document, parents, iconCssSelector, childOfSiblingCssSelector)
Else
website = "Not found"
End If
End With
'Assumes headers already present
Dim nextRow As Long
nextRow = GetLastRow(ws, 1) + 1
ws.Cells(nextRow, 1).Resize(1, 2) = Array(email, website)
.Quit
End With
End Sub
Public Function ElementIsPresent(ByVal document As HTMLDocument, ByVal cssSelector As String) As Boolean
ElementIsPresent = document.querySelectorAll(cssSelector).length > 0
End Function
Public Function GetText(ByVal document As HTMLDocument, ByVal parents As Object, ByVal iconCssSelector As String, ByVal childOfSiblingCssSelector As String) As String
'in this instance and with microsoft IE DOM you cannot select for parent of an element with pseudo class _
of :has(>child); nor use :contains... instead pass expected parent selector, that houses _
both the icon element for website and the website address itself, and loop all matches checking for website icon _
if found use childOfSiblingCssSelector to extract
Dim i As Long, html As HTMLDocument
Set html = New HTMLDocument
For i = 0 To parents.length - 1
html.body.innerHTML = parents.item(i).innerHTML
If ElementIsPresent(html, iconCssSelector) Then
GetText = html.querySelector(childOfSiblingCssSelector).innerText
Exit Function
End If
Next
GetText = "Not found"
End Function
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
End With
End Function
Project references (VBE > Tools > References):
Microsoft HTML Object Library
Additional reading:
https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll
https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector
Edit:
Example of loop - assumes no empty rows in column A between urls.
Option Explicit
'VBE > Tools > References > HTML Object Library
Public Sub test()
Dim ie As Object, ws As Worksheet, wsUrls As Worksheet, urls()
Set ie = CreateObject("InternetExplorer.Application")
Set ws = ThisWorkbook.Worksheets("Scraper")
Set wsUrls = ThisWorkbook.Worksheets("Url List")
With wsUrls
urls = Application.Transpose(.Range("A2:A" & .Cells(.rows.Count, "A").End(xlUp).Row).Value)
End With
Dim results(), r As Long
ReDim results(1 To UBound(urls), 1 To 2)
With ie
.Visible = True
For r = LBound(urls) To UBound(urls)
.Navigate2 urls(r)
While .Busy Or .readyState < 4: DoEvents: Wend
With .document
Dim email As String, website As String, iconCssSelector As String
'iconCssSelector for website icon in this instance
iconCssSelector = "[src='https://static.xx.fbcdn.net/rsrc.php/v3/yV/r/EaDvTjOwxIV.png']"
If ElementIsPresent(ie.document, "[href^=mailto]") Then
email = ie.document.querySelector("[href^=mailto]").innerText
Else
email = "Not found"
End If
Dim parents As Object, sharedParentCssSelector As String, childOfSiblingCssSelector As String
sharedParentCssSelector = "._5aj7" 'target parent of both icon and the website link
childOfSiblingCssSelector = "._50f4" '< to target website address after finding right parent
If ElementIsPresent(ie.document, iconCssSelector) _
And ElementIsPresent(ie.document, sharedParentCssSelector) Then
Set parents = ie.document.querySelectorAll(sharedParentCssSelector) 'css selector used to allow for greater flexibility in element matching
website = GetText(ie.document, parents, iconCssSelector, childOfSiblingCssSelector)
Else
website = "Not found"
End If
End With
'Assumes headers already present
Dim nextRow As Long
results(r, 1) = email
results(r, 2) = website
Next
.Quit
End With
nextRow = GetLastRow(ws, 1) + 1
ws.Cells(nextRow, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub
Public Function ElementIsPresent(ByVal document As HTMLDocument, ByVal cssSelector As String) As Boolean
ElementIsPresent = document.querySelectorAll(cssSelector).length > 0
End Function
Public Function GetText(ByVal document As HTMLDocument, ByVal parents As Object, ByVal iconCssSelector As String, ByVal childOfSiblingCssSelector As String) As String
'in this instance and with microsoft IE DOM you cannot select for parent of an element with pseudo class _
of :has(>child); nor use :contains... instead pass expected parent selector, that houses _
both the icon element for website and the website address itself, and loop all matches checking for website icon _
if found use childOfSiblingCssSelector to extract
Dim i As Long, html As HTMLDocument
Set html = New HTMLDocument
For i = 0 To parents.length - 1
html.body.innerHTML = parents.item(i).innerHTML
If ElementIsPresent(html, iconCssSelector) Then
GetText = html.querySelector(childOfSiblingCssSelector).innerText
Exit Function
End If
Next
GetText = "Not found"
End Function
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
End With
End Function

Excel Macro To Pull Google Image Links

The goal is to get images from Google Images that match the part numbers in my database. My code runs, and it pulls up the correct Google pages but refuses to put the links into the spreadsheet. I have tried everything I can think of, but as of now, I keep on getting Error 1004 (Application-defined or Object-defined error).`
Sub SearchBotGoogleImgLink()
Dim objIE As Object
Set IE = CreateObject("InternetExplorer.Application")
Dim HTMLdoc As HTMLDocument
Dim imgElements As IHTMLElementCollection
Dim imgElement As HTMLImg
Dim aElement As HTMLAnchorElement
Dim n As Integer
Dim i As Integer
Dim url As String
Dim url2 As String
Dim m As Long
Dim lastRow As Long
Dim url3 As String
Dim SearchRow As Long
Dim aEle As HTMLLinkElement
Worksheets("Sheet1").Select
SearchRow = 1
Do Until IsEmpty(ActiveSheet.Cells(SearchRow, 1))
Sheets("Sheet1").Select
Application.StatusBar = SearchRow - 1 & " of " & "4368" & " Items Done"
Item = Trim(ActiveSheet.Cells(SearchRow, 1))
url = "https://www.google.com/search?hl=en&biw=1600&bih=796&tbm=isch&sa=1&ei=CTOpW_2jO6nAjwT67rqACw&q=A2N0015C3KUU&oq=" & Cells(SearchRow, 1) & "&oq=A2N0015C3KUU&gs_l=img.12...0.0..1704...0.0..0.0.0.......1......gws-wiz-img.9wB6WwQJhwA"
Set objIE = New InternetExplorer
objIE.Visible = True
objIE.navigate url
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
For Each aEle In objIE.document.getElementsByTagName("IMG")
result = aEle
Sheets("Sheet1").Range(SearchRow & "C").Value = result
Sheets("Sheet1").Range(SearchRow & "D") = aEle.innerHTML
Sheets("Sheet1").Range(SearchRow & "F").Value = aEle.innerText
Debug.Print aEle.innerText
Next
Loop
'For i = 1 To lastRow
'url = "https://www.google.com/search?hl=en&biw=1600&bih=796&tbm=isch&sa=1&ei=CTOpW_2jO6nAjwT67rqACw&q=A2N0015C3KUU&oq=" & Cells(SearchRow, 1) & "&oq=A2N0015C3KUU&gs_l=img.12...0.0..1704...0.0..0.0.0.......1......gws-wiz-img.9wB6WwQJhwA"
Set HTMLdoc = objIE.document
Set imgElements = HTMLdoc.getElementsByTagName("IMG")
n = 1
For Each imgElement In imgElements
If InStr(ingElement.src, sImageSearchString) Then
If imgElement.ParentNode.nodeName = "A" Then
Set aElement = imgElement.ParentNode
If n = 2 Then
url2 = aElement.href 'imgElement.src
url3 = imgElement.src 'aElement.href
n = n + 1
End If
End If
End If
Next
Cells(SearchRow, 5) = url2
IE.Quit
Set IE = Nothing
End Sub

Notes on your code:
You need Option Explicit at the top of your code to check on variable declarations and typos amongst other advantages. There are a number of missing declarations e.g. result, and used ones later e.g. Set IE = CreateObject("InternetExplorer.Application"). You have two different variables (one late bound and one early) both creating IE instances. You only in fact use one.
Your current error may be down to you trying to work with an object here:
result = aEle which won't work without the Set keyword to provide the required reference.
Without example URLs and expected output it is difficult to advise on the later loops in your code. You appear to have a duplicate loop over IMG elements but this time with some restrictions. It is likely these loops can be merged.
An example:
The following uses an arbitrary concatenation in to pull the img src links in from search results based on A2N0015C3KUU.
It uses a CSS selector combination of #ires img[src] to target elements with img tags and src attributes within the parent element with id ires (search results).
It is to demonstrate the principle of gathering aNodeList of matching elements and writing out to a sheet. The querySelectorAll method applied the CSS selector combination to the HTMLDocument and returns the nodeList. The nodeList is looped along its .Length, with items accessed by index starting at 0.
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer
With IE
.Visible = True
.navigate "https://www.google.com/search?hl=en&biw=1600&bih=796&tbm=isch&sa=1&ei=CTOpW_2jO6nAjwT67rqACw&q=A2N0015C3KUU&oq=1&%20%22&oq=A2N0015C3KUU&gs_l=img.12...0.0..1704...0.0..0.0.0.......1......gws-wiz-img.9wB6WwQJhwA"
While .Busy Or .readyState < 4: DoEvents: Wend
Dim aNodeList As Object, i As Long
Set aNodeList = IE.document.querySelectorAll("#ires img[src]")
For i = 0 To aNodeList.Length - 1
ActiveSheet.Cells(i + 2, 4) = aNodeList.item(i).src
Next
'Quit '<== Remember to quit application
End With
End Sub

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Avoid crashing VBA when running IE web scraping multiple times - excel

Related

Bulk Email Extractor. Need to add a Regxp - excel VBA

Is there a way to slow down a Web Scraper so it will pick up the code?

How to scrape data from Bloomberg's website with VBA

VBA Code Scraper not placing data in right columns

Excel Macro To Pull Google Image Links

Categories

Resources