Excel 2016 VBA web scraping using getElementsByClassName - excel

My system is working with Windows 10 64bit and Office 2016
I am a bit familiar with Excel VBA, and I am trying to learn web scraping using VBA. Unfortunately, there is little information on in-depth digging on web data.
The website I want to scrape data from is bizbuysell.com from the seller's offers such as
Sample URL 1
Sample URL 2
There is a section that starts with the headline Detailed Information
The HTML code is: Detailed Information
<h3>Detailed Information</h3>
Location:
Pinellas County, FL
Inventory:
Included in asking price
Employees:
8 FT
I want to scrape data from this section.
The problem is that there are some 18 data labels and their respective values possible, but only those are shown for which the seller has entered data.
My idea was to search for all possible data labels and if they are not available then next data field
I tried it with the following code, but Obviously I made a mistake
For Each ele In doc.getElementsByClassName("listingProfile_details")
txt = ele.parentElement.innerText
If Left(txt, 8) = "Location" Then
location = Trim(Mid(txt, InStrRev(txt, ":") + 1))
ElseIf Left(txt, 4) = "Inventory" Then
inventory = Trim(Mid(txt, InStrRev(txt, ":") + 1))
.
.
.
End If
Next ele
I hope that someone can show me the correct VBA code to check for all 18 possible data labels and the respective data
Thank you so much!
Tony

One way it to gather a nodeList of the dt/dd elements and loop it with a step 2 so you can access the label at n indices and the value at n + 1.
To handle differing numbers of labels being present, you can initialise a fresh dictionary, with all the possible labels as keys, and the associated values as vbNullString, during the loop over urls, such that for each new XHR request you get a new dictionary ready to populate with the labels that are found. By using .Exists test, you only update the values for keys (labels) that are found at the current URI.
You can store all results in an array to write out to the sheet in one go at end.
There are lots of additional notes within the code.
Option Explicit
Public Sub GetDetailedBizBuySellInfo()
Dim http As Object, urls() As Variant
Dim html As MSHTML.HTMLDocument 'VBE > Tools > References > Microsoft HTML Object Library
urls = Array("https://www.bizbuysell.com/Business-Opportunity/covid-friendly-commercial-cleaning-est-30-years-100k-net/1753433/?d=L2Zsb3JpZGEvaGlsbHNib3JvdWdoLWNvdW50eS1idXNpbmVzc2VzLWZvci1zYWxlLzI/cT1hVEk5T0RFc01qQXNNekFzTnpnbWJtRndQV1UlM0Q=", _
"https://www.bizbuysell.com/Business-Opportunity/Established-Cleaning-Business-Tampa-St-Pete/1849521/?utm_source=bizbuysell&utm_medium=emailsite&utm_campaign=shtmlbot&utm_content=headline")
Set http = CreateObject("MSXML2.XMLHTTP")
Set html = New MSHTML.HTMLDocument
Dim url As Long, results() As Variant
ReDim results(1 To UBound(urls) + 1, 1 To 19) 'size the final output array. _
There will be the number of urls as row count, the number of labels as column count + 1 to store the url itself. You need to update the list of labels below. See GetBlankDetailedInformationDictionary
With http
For url = LBound(urls) To UBound(urls) 'loop url list
.Open "Get", urls(url), False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
html.body.innerHTML = .responseText
Dim currentDetailedInformation As Scripting.Dictionary 'VBE > Tools > References > Microsoft Scripting Runtime
Set currentDetailedInformation = GetCurrentDetailedInfo(html) 'use retrieved html to return a dictionary with key as dt > strong e.g.Location; value as dd e.g. Tampa, FL
AddCurrentDetailedInfoToResults results, currentDetailedInformation, url, urls(url) 'url + 1 (zero indexed) will keep track of current row number to add to results
Next
End With
With ActiveSheet 'better to update with explicit sheet/be careful not to overwrite data already in a sheet
.Cells(1, 1).Resize(1, UBound(results, 2)) = currentDetailedInformation.keys ' write out headers
.Cells(1, UBound(results, 2)) = "Url"
.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results ' write out results
End With
End Sub
Public Sub AddCurrentDetailedInfoToResults(ByRef results As Variant, ByVal currentDetailedInformation As Scripting.Dictionary, ByVal url As Long, ByVal currentUrl As String)
Dim key As Variant, currentColumn As Long
For Each key In currentDetailedInformation.keys
currentColumn = currentColumn + 1 'increase column count to update results array with
results(url + 1, currentColumn) = currentDetailedInformation(key)
Next
results(url + 1, currentColumn + 1) = currentUrl
End Sub
Public Function GetCurrentDetailedInfo(ByVal html As MSHTML.HTMLDocument) As Scripting.Dictionary
' Gathers a list of all the relevant dd, dt nodes within the passed in HTMLDocument.
' Requests a new blank dictionary whose keys are the labels (child strong element of dt tag)
'Updates blank dictionary, per key, where present, with dd value in a loop of step 2 as list is strong, dd, strong, dd etc.....
Dim updatedDictionary As Scripting.Dictionary, listOfLabelsAndValues As MSHTML.IHTMLDOMChildrenCollection
Set updatedDictionary = GetBlankDetailedInformationDictionary
'Css pattern to match the appropriate nodes
Set listOfLabelsAndValues = html.querySelectorAll("#ctl00_ctl00_Content_ContentPlaceHolder1_wideProfile_listingDetails_dlDetailedInformation dt > strong, #ctl00_ctl00_Content_ContentPlaceHolder1_wideProfile_listingDetails_dlDetailedInformation dd")
Dim currentIndex As Long
For currentIndex = 0 To listOfLabelsAndValues.length - 2 Step 2 'nodeList is 0 index based
'On Error Resume Next 'key (label) may not be present for current html document _
i.e. url so ignore errors when attempting to update blank dictionary via dt > strong matching on key. If label not found then value = vbNullString
Dim key As String, value As String
key = Trim$(listOfLabelsAndValues.Item(currentIndex).innerText)
value = Trim$(listOfLabelsAndValues.Item(currentIndex + 1).innerText) 'as we are looping every 2 indices 0,2,4 ....
If updatedDictionary.Exists(key) Then updatedDictionary(key) = value
'On Error GoTo 0
Next
Set GetCurrentDetailedInfo = updatedDictionary ' return updated dictionary
End Function
Public Function GetBlankDetailedInformationDictionary() As Scripting.Dictionary
Dim blankDictionary As Scripting.Dictionary, keys() As Variant, key As Long
Set blankDictionary = New Scripting.Dictionary
'' TODO Note: you would add in all 18 labels into array below.
keys = Array("Location:", "Type:", "Inventory:", "Real Estate:", "Building SF:", _
"Building Status:", "Lease Expiration:", "Employees:", "Furniture, Fixtures, & Equipment (FF&E):", _
"Facilities:", "Competition:", "Growth & Expansion:", "Financing:", "Support & Training:", _
"Reason for Selling:", "Franchise:", "Home-Based:", "Business Website:")
For key = LBound(keys) To UBound(keys)
blankDictionary(keys(key)) = vbNullString 'add blank entry to dictionary for each label
Next
Set GetBlankDetailedInformationDictionary = blankDictionary
End Function

Related

Adding String to Content Control Box / Replace Dropdown with Content Control

I've got a working code but I'd like to replace my Drop-Down with Content Control, because I need to be able to also manually type in a value.
The value inside is a list from a https, this string works completely fine, so please ignore.
Here's my code:
Dim MyRequest As Object
Dim Data() As String
Dim i As Integer
Dim j As Integer
Dim maxi As Integer
Set MyRequest = CreateObject("WinHttp.WinHttpRequest.5.1")
MyRequest.Open "GET", _
"https... (This is hidden for security resons, /csv/)"
' Send Request.
MyRequest.Send
'And we get this response
'MsgBox MyRequest.ResponseText
Data = Split(MyRequest.ResponseText, "|")
If UBound(Data()) > 25 Then
maxi = 25
Else
maxi = UBound(Data())
End If
For j = 1 To 6
ActiveDocument.FormFields("Dropdown" & j).DropDown.ListEntries.Clear
For i = 0 To maxi - 1
ActiveDocument.FormFields("Dropdown" & j).DropDown.ListEntries.Add Name:=Data(i)
Next i
Next j
End Sub
You should not use content controls and formfields in the same document. They were not designed to be used that way and doing so is a known source of problems.
As you observed, dropdown formfields don’t support text entry. To provide that facility, you could provide an option in the dropdown for 'free text' and use an on-exit macro with an Inputbox to insert the user’s 'free text' into the dropdown. For example, suppose you have a dropdown with 5 items, the last of which offers free text entry (e.g. an 'Other' option). Adding the following on-exit macro to the formfield will provide that:
Sub FreeText()
Dim StrNew As String, i As Long
With Selection.FormFields(1).DropDown
i = .ListEntries.Count
If .Value = i Then
StrNew = Trim(InputBox("Input your text", "Data Entry", .ListEntries(i).Name))
If StrNew = vbNullString Then Exit Sub
.ListEntries(i).Delete
.ListEntries.Add StrNew
.Value = i
End If
End With
End Sub

I have questions about how to click and search on web using vba

I have questions about how to click and search on web using vba.
I have wrote the code, but cannot find how to click the button in this web
Sub LEISearch()
'dimension (declare or set aside memory for) our variables
Dim objIE As InternetExplorer 'special object variable representing the IE browser
Dim LEI As HTMLLinkElement 'special object variable for an <a> (link) element
Dim y As Integer 'integer variable we'll use as a counter
Dim result As String 'string variable that will hold our result link'
Dim result2 As String
'initiating a new instance of Internet Explorer and asigning it to objIE
Set objIE = New InternetExplorer
'make IE browser visible (False would allow IE to run in the background)
objIE.Visible = True
'navigate IE to this web page (a pretty neat search engine really)
objIE.navigate "https://www.gmeiutility.org/search.jsp?keyWord"
'wait here a few seconds while the browser is busy
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
'in the search box put cell "A2" value, the word "in" and cell "C1" value
objIE.document.getElementById("searchInput").Value = _
Sheets("Macro1").Range("A1").Value
'click the 'go' button
Set LEIButton = objIE.document.getElementsByClassName("hiddenSubmitButton")
LEIButton.Focus
LEIButton.Click
'wait again for the browser
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
End Sub
This case is a really nice and clean example of web scraping so i will take this opportunity to present an educational post.
I highly recommend avoiding using IE to scrape websites whenever it's possible. It's highly inefficient. Especially in a case like this where there can be multiple pages of results. Instead, you can use HTTP requests.
An HTTP request is a structured way to request something from a server. In this case we want to send a keyword to the server and get the corresponding search results.
To find out how this request should look like, you have to inspect the network traffic when the button with the magnifying glass is clicked. You can do that through your browser's developer tools (Ctrl+Shift+E if you're using Firefox):
If you go through the Headers and the Params of the request you will see how the url, the body and the headers should look like. In this particular case, all the parameters are encoded into the url and the headers are not essential to the success of the request, so all you need is the url.
Some of the parameters of the url are the keyword, the number of results per page and the number of page.
The response's payload is in json format. You can inspect its structure using a tool like this. Here's how it looks like:
Basically the JSON response consists of as many results as you have specified that should be displayed per page (or less). To get the next page you need to send a new request with the same keyword but specifying a new page number and so on.
In fact, as you can see the website offers a lot more data than what's displayed on your browser, which could prove to be useful.
The code below searches for the keyword test, while requesting 25 results per page. One first request is sent to find out how many pages of results are there and then the code loops through all pages and prints the results in a worksheet.
TL;DR
Option Explicit
Sub main()
Dim sht As Worksheet
Dim totalNumberOfPages As Long
Dim searchResults As Object
Dim pageNumber As Long
Dim results() As String
Dim entity As Object
Dim i As Long, j As Long
Dim rng As Range
Set sht = ThisWorkbook.Worksheets("Name of your Worksheet")
''''''First request to find out the number of pages''''''
Set searchResults = XHRrequest("test", 25, 1) '
totalNumberOfPages = searchResults("totalPages") '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''
'''''''''''''''''''Loop through all the pages''''''''''''''''''''''''''''''''
For pageNumber = 1 To totalNumberOfPages Step 1 '
Set searchResults = XHRrequest("test", 25, pageNumber) '
ReDim results(1 To searchResults("entitySearchResult").Count, 1 To 7) '
i = 0 '
'''''''''''write the results in an array'''''''''''''''''''''''''''' '
For Each entity In searchResults("entitySearchResult") ' '
i = i + 1 ' '
results(i, 1) = entity("LEINumber") ' '
results(i, 2) = entity("legalName") ' '
results(i, 3) = entity("city") ' '
results(i, 4) = entity("headquartersCountry") ' '
results(i, 5) = entity("recordStatus") ' '
results(i, 6) = entity("renewalStatus") ' '
results(i, 7) = entity("entityStatus") ' '
Next entity ' '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' '
'''''''''''''''write all the results in the worksheet in one go''''' '
With sht ' '
Set rng = .Range("A" & .Rows.Count).End(xlUp).Offset(1, 0) ' '
End With ' '
rng.Resize(UBound(results, 1), UBound(results, 2)) = results ' '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' '
Next pageNumber '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
End Sub
Public Function XHRrequest(ByVal key As String, ByVal resultsPerPage As Long, ByVal pageNumber As Long) As Object
Dim req As New WinHttpRequest
Dim url As String
url = "https://www.gmeiutility.org/actions/Search/?isPendingValidationChecked=true&isSearchAllLOUChecked=true&keyWord=" & key & "&page=" & pageNumber & "&resultsPerPage=" & resultsPerPage & "&searchType=baseSearch" 'build the URL according to the parameters
'''''''''Send the HTTP request'''''''''''''''''''''''''''''''
With req '
.Open "POST", url, False '
.send '
Set XHRrequest = JsonConverter.ParseJson(.responseText) '
End With '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
End Function
For demonstration purposes the code above prints all the data in worksheet named Name of your Worksheet.
If you need to perform multiple searches you can easily modify the code to best fit your needs. More specifically you can loop through multiple keywords and call the XHRrequest function using those keywords instead of "test".
Here's a sample of the output:
You will need to add the following references to your project (VBE>Tools>References):
Microsoft WinHTTP Services version 5.1
Microsoft HTML Objects Library
Microsoft Scripting Runtime
You will also need to add this JSON parser to your project. Follow the installation instructions in the link and you should be set to go.

Pulling specific table cells from Morningstar, then looping to next Morningstar page

I am currently trying to scrape certain pieces of data from a table on Morningstar, then have it loop to the next ticker and repeat until there are no more tickers.
Currently, it will pull the entire "rank in category" row on the Trailing Total Returns table. I'm simply trying to pull the 3 month, 6 month, YTD, 1 year, 3 year, and 5 year. When it's done pulling those, it will loop to the next ticker as determined by the "Cells(p, 14)" in the navigate line.
ie. It detects "LINKX" is in cell 1, 14 so it navigates to http://performance.morningstar.com/fund/performance-return.action?t=LINKX&region=usa&culture=en_US and pulls all of the "Rank in Category" lines from "trailing total returns" table. I only want the specified ones put into specified cell locations, then loop to the next ticker.
I've browsed through many of these threads, using excel VBA I am trying to pull key specific info from a certain tickers page, then loop to next ticker and repeat.
Declare PtrSafe Function apiShowWindow Lib "user32" Alias "ShowWindow" _
(ByVal hwnd As LongPtr, ByVal nCmdShow As LongPtr) As LongPtr
Global Const SW_MAXIMIZE = 3
Global Const SW_SHOWNORMAL = 1
Global Const SW_SHOWMINIMIZED = 2
Sub LinkedInWebScrapeScript()
Dim objIE As InternetExplorer
Dim html As HTMLDocument
Set objIE = GetObject("new:{D5E8041D-920F-45e9-B8FB-B1DEB82C6E5E}")
objIE.Visible = 1
Dim p As Integer
p = 3
objIE.navigate ("http://performance.morningstar.com/fund/performance-return.action?t=" & Cells(p, 14) & "&region=usa&culture=en_US")
Application.Wait Now + #12:00:02 AM#
While objIE.Busy
DoEvents
Wend
apiShowWindow objIE.hwnd, SW_MAXIMIZE
For i = 1 To 2
objIE.document.parentWindow.scrollBy 0, 100000 & i
Application.Wait Now + #12:00:01 AM#
Next i
Dim TDelements As IHTMLElementCollection
Dim htmldoc As MSHTML.IHTMLDocument 'Document object
Dim eleColtr As MSHTML.IHTMLElementCollection 'Element collection for tr tags
Dim eleColtd As MSHTML.IHTMLElementCollection 'Element collection for td tags
Dim eleColtd1 As MSHTML.IHTMLElementCollection
Dim eleRow As MSHTML.IHTMLElement 'Row elements
Dim eleCol As MSHTML.IHTMLElement 'Column elements
Set htmldoc = objIE.document 'Document webpage
Set eleColtr = htmldoc.getElementsByTagName("tr") 'Find all tr tags
Set TDelements = htmldoc.getElementsByTagName("table")
'This section populates Excel
i = 0 'start with first value in tr collection
Set eleColtd = htmldoc.getElementsByClassName("r_table3 width955px print97")(0).getElementsByClassName("last")(0).getElementsByClassName("row_data divide") 'get all the td elements in that specific tr
For Each eleCol In eleColtd 'for each element in the td collection
Sheets("Sheet2").Range("A1").Offset(i, j).Value = eleCol.innerText 'paste the inner text of the td element, and offset at the same time
j = j + 1 'move to next element in td collection
Next eleCol 'rinse and repeat
i = i + 1
p = p + 1
objIE.navigate ("http://performance.morningstar.com/fund/performance-return.action?t=" & Cells(p, 14) & "&region=usa&culture=en_US")
Set eleColtd = htmldoc.getElementsByClassName("r_table3 width955px print97")(0).getElementsByClassName("last")(0).getElementsByClassName("row_data divide") 'get all the td elements in that specific tr
For Each eleCol In eleColtd 'for each element in the td collection
Sheets("Sheet2").Range("A1").Offset(i, j).Value = eleCol.innerText 'paste the inner text of the td element, and offset at the same time
z = z + 1
j = j + 1 'move to next element in td collection
Next eleCol 'rinse and repeat
End Sub
It will pull the entire "rank in category" row on the Trailing Total Returns table. I'm simply trying to pull the 3 month, 6 month, YTD, 1 year, 3 year, and 5 year. When it's done pulling those, it will loop to the next ticker as determined by the "Cells(p, 14)" in the navigate line.
The following shows a loop and how to select the appropriate table, tbody then table cells using css selectors. Tickers are read into an array from column N starting at row 1. It assumes there are not blank cells within the range (though you could add a test to be sure).
There is a loop over the array, which contains each ticker, and the TICKER placeholder in the url is replaced with the current ticker value.
There is a line to click on the monthly display tab.
The appropriate row is identified via
Set rankings = .querySelectorAll("#tab-month-end-content .last td")
#tab-month-end-content is an id selector which gets the right tab, then .last is the class selector for the class name of the last tbody (which is last), then td is used to specify the child td cells within that tbody.
CSS selectors:
Modern browsers are optimized for css. Css selectors are a fast way to match on elements in an html document. Css selectors are applied via querySelector or querySelectorAll methods; in this case, of HTMLDocument (ie.document). querySelector returns a single node: the first match for the css selector; querySelectorAll returns a nodeList of all matched items - you then index into that nodeList to get specific items e.g. the second td cell is at index 1.
Looking at the pattern we specified:
#tab-month-end-content .last td
The first part is an id selector, #, which selects an element by id
#tab-month-end-content
When applied to the page this returns two matches and we want the second
Click on image to enlarge
The next part
.last
is a class selector, ., for class name last. This selects the tbody tag child element shown in the image above. As only the second id matched element has this child we are now working with the right parent element to go on and select the td type elements using type selector
td
The whitespace,, in between each part described above are known as descendant combinators, and they specify that elements matched by the second selector are selected if they have an ancestor element matching the first selector i.e. that the selector to the left is a parent of the selector matched elements retrieved by the adjacent css selector to the right.
We can see this with the next image:
Click on image to enlarge
VBA:
Option Explicit
Public Sub GetData()
Dim ie As Object, tickers(), ws As Worksheet, lastRow As Long
Dim results(), headers(), r As Long, i As Long, url As String
headers = Array("ticker", "3m", "6m", "ytd", "1y", "3y", "6y")
Set ws = ThisWorkbook.Worksheets("Sheet1")
tickers = Application.Transpose(ws.Range("N1:N" & GetLastRow(ws, 14)).Value)
ReDim results(1 To UBound(tickers), 1 To UBound(headers) + 1)
Set ie = GetObject("new:{D5E8041D-920F-45e9-B8FB-B1DEB82C6E5E}")
With ie
.Visible = True
For i = LBound(tickers) To UBound(tickers)
r = r + 1
url = Replace$("http://performance.morningstar.com/fund/performance-return.action?t=TICKER&region=usa&culture=en_US", "TICKER", tickers(i))
.Navigate2 url
While .Busy Or .readyState < 4: DoEvents: Wend
.document.querySelector("[tabname='#tabmonth']").Click
Dim rankings As Object
Do
Loop While .document.querySelectorAll("#tab-month-end-content .last td").Length = 0 'could add timed loop here
With .document
Set rankings = .querySelectorAll("#tab-month-end-content .last td")
On Error Resume Next
results(r, 1) = tickers(i)
results(r, 2) = rankings.item(1).innerText
results(r, 3) = rankings.item(2).innerText
results(r, 4) = rankings.item(3).innerText
results(r, 5) = rankings.item(4).innerText
results(r, 6) = rankings.item(5).innerText
results(r, 7) = rankings.item(6).innerText
On Error GoTo 0
End With
Set rankings = Nothing
Next
ws.Cells(1, 15).Resize(UBound(results, 1), UBound(results, 2)) = results
.Quit
End With
End Sub
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
End With
End Function
As mentioned by #SIM, you could use xmlhttp and avoid browser though not sure with your security settings whether need to whitelist sites. You will need to explore if the placeholder is valid in the url here: XNAS:TICKER. The XNAS prefix may vary across your tickers, in which case you would need the appropriate string including prefix in column N and then replace the extended placeholder with that e.g. .....=PLACEHOLDER&region.......
Option Explicit
Public Sub GetData()
Dim tickers(), ws As Worksheet, lastRow As Long
Dim results(), headers(), r As Long, i As Long, url As String, html As HTMLDocument
Set html = New HTMLDocument 'vbe > tools > references > Microsoft HTML Object Library
headers = Array("ticker", "3m", "6m", "ytd", "1y", "3y", "6y")
Set ws = ThisWorkbook.Worksheets("Sheet1")
tickers = Application.Transpose(ws.Range("N1:N" & GetLastRow(ws, 14)).Value)
ReDim results(1 To UBound(tickers), 1 To UBound(headers) + 1)
With CreateObject("MSXML2.XMLHTTP")
For i = LBound(tickers) To UBound(tickers)
r = r + 1
url = Replace$("http://performance.morningstar.com/perform/Performance/fund/trailing-total-returns.action?&t=XNAS:TICKER&region=usa&culture=en-US&cur=&ops=clear&s=0P0000J533&ndec=2&ep=true&align=m&annlz=true&comparisonRemove=false&loccat=&taxadj=&benchmarkSecId=&benchmarktype=", "TICKER", tickers(i))
.Open "GET", url, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.setRequestHeader "DNT", "1"
.send
html.body.innerHTML = .responseText
Dim rankings As Object
With html
Set rankings = .querySelectorAll(".last td")
On Error Resume Next
results(r, 1) = tickers(i)
results(r, 2) = rankings.item(1).innerText
results(r, 3) = rankings.item(2).innerText
results(r, 4) = rankings.item(3).innerText
results(r, 5) = rankings.item(4).innerText
results(r, 6) = rankings.item(5).innerText
results(r, 7) = rankings.item(6).innerText
On Error GoTo 0
End With
Set rankings = Nothing
Next
ws.Cells(1, 15).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
End With
End Function

Is there a way to slow down a Web Scraper so it will pick up the code?

I wrote a macro to go to WU to get historical data and for the most part, it works. However, I believe that the macro is running too fast for it to pick up the data from the website.
https://www.wunderground.com/history/daily/us/tx/el-paso/KELP/date/2017-1-3
Is the website and the table I want to get is tablesaw-sortable.
I have tried the following: DoEvents and Application.Wait (Now + TimeValue("00:00:01")) to try to slow down the process.
Sub BrowseToWU()
Dim IE As New SHDocVw.InternetExplorer
Dim HTMLDoc As MSHTML.HTMLDocument
Dim RowAddress As Integer
Dim WebAddress As String
Dim DateSheet As Date
Dim WkDay As Integer
Dim DateSheetName As String
'Application.ScreenUpdating = False
'Application.StatusBar = True
RowAddress = 2
IE.Visible = True
Do Until RowAddress = 60
WebAddress = Range("A" & RowAddress)
DateSheet = Right(WebAddress, 8)
DateSheetName = Right(WebAddress, 8)
WkDay = Weekday(DateSheet, vbSunday)
If WkDay < 3 Then
RowAddress = RowAddress + 1
ElseIf WkDay > 6 Then
RowAddress = RowAddress + 1
Else
IE.Navigate WebAddress
Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop
Set HTMLDoc = IE.Document
DoEvents
Application.Wait (Now + TimeValue("00:00:05"))
DoEvents
ProcessHTMLPage HTMLDoc
DateSheet = Right(WebAddress, 8)
DoEvents
Application.Wait (Now + TimeValue("00:00:01"))
ActiveSheet.Name = DateSheetName
DoEvents
RowAddress = RowAddress + 1
'IE.Quit
Worksheets("Sheet1").Activate
End If
Loop
End Sub
Option Explicit
Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)
Dim HTMLTable As MSHTML.IHTMLElement
Dim HTMLTables As MSHTML.IHTMLElementCollection
Dim HTMLRow As MSHTML.IHTMLElement
Dim HTMLCell As MSHTML.IHTMLElement
Dim RowNum As Long, ColNum As Integer
'Dim IE As New SHDocVw.InternetExplorer
'Dim Ws As Worksheet
Set HTMLTables = HTMLPage.getElementsByClassName("tablesaw-sortable")
'DoEvents
For Each HTMLTable In HTMLTables
Worksheets.Add
DoEvents
Range("A1").Value = HTMLTable.className
Range("B1").Value = Now
RowNum = 2
For Each HTMLRow In HTMLTable.getElementsByTagName("tr")
'Debug.Print vbTab & HTMLRow.innerText
ColNum = 1
For Each HTMLCell In HTMLRow.Children
Cells(RowNum, ColNum) = HTMLCell.innerText
ColNum = ColNum + 1
Next HTMLCell
RowNum = RowNum + 1
Next HTMLRow
Next HTMLTable
DoEvents
'IE.Quit
End Sub
The macro is supposed to run through sheet1 picking up the web address to the historical data if it satisfies the criteria of being a certain day of the week.
IE will open and then it will kick over to the next module that will take in the data.
A new worksheet is created and the data pasted into the new worksheet.
The worksheet is renamed to the date of the data.
The web address sheet is activated again and the process starts over again.
The error I get is that the data isn't taken from the website, so the For statement ends and the web address sheet is renamed and an error occurs.
One way around this is to call the API that the page is using to get that info.
The API returns json which you can parse with a json parser. I use jsonconverter.bas. After installing the code from that link in a standard module called JsonConverter, go to VBE > Tools > References > Add a reference to Microsoft Scripting Runtime.
Finding the API:
If you press F12 to open developer tools and go to the Network tab and then press F5 to refresh any url of interest you will see the recorded web traffic. You can find the API call there.
See my answer here on how to search the network traffic using a specific observation value you expect to see in the response - this will filter the list of network traffic to those items containing the value of interest. Be sensible in selecting the value - you want something unlikely to occur elsewhere. You can also filter the network traffic to XHR only.
The API response:
The API returns json. More specifically, it returns a dictionary containing 2 keys. The second key, "observations", can be used to return a collection (denoted by []) of dictionaries (denoted by {}).
Each dictionary represents a row of the table (daily observations). You can loop this collection, and then loop the inner dictionaries, to access the table row values and reconstruct the table by populating an array. Explore example json response here.
Explanation of json structure:
click here to enlarge
Explanation of code:
The code is broken down into a number of helper subs and functions, allocating certains tasks to each, to
make code easier to debug and follow, as well as better align with Object Oriented Programming Principles.
Overall the process is:
Gather urls for Worksheet("Sheet1"). Helper function GetAllUrls.
Process those urls and only retain the dates which correspond with Tue-Thur. These are kept as strings formatted as "yyyymmdd" so can be passed to API later. This is handled by helper functions GetOnlyQualifyingUrlsDates and IncludeThisDate. IncludeThisDate performs the check for whether to include; GetOnlyQualifyingUrlsDates handles the looping and formatting of results.
Issue xmlhttp requests by looping over qualifying url dates and concatenating those into the url for the API call, then issuing the request. This is performed by the main sub GetTables.
Sheet creation, for output, is handled by helper function CreateWorksheet. This function calls another helper function, SheetExists, to ensure sheets are only created if they don't already exist, otherwise, the existing sheet by that name is used.
The resultant json response, from step 3, is passed to a helper sub WriteOutResults which accepts the json variable and the output sheet object as arguments. It extracts all the info from the json response; essentially reconstructing the table. It adds the table and headers to the appropriate sheet.
It calls helper function Epoch2Date, which handles the unix timestamp to datetime conversion for the two unix fields in the json object.
TODO:
The API key may be time limited. Add a helper function which returns the current valid key.
The API accepts start date and end date parameters in the url construct. It would be far better to issue one request for the entire range if possible, or chunked ranges e.g. months, to reduce the number of requests made. This would also reduce the likelihood of being blocked. This would mean some additional code would need to be written, before writing out results, to ensure only dates of interest are being written to sheets. Though you could write out all then simply loop all sheets and delete those that aren't wanted (perfectly doable if we are talking about 365 dates total). Personally, I would handle the include date part in the construction of the table from a single request (if possible) that has the min and max dates for entire urls listed passed as start and end date parameters. I would then write a single flat table out to one sheet as this will be much easier for later data analysis.
VBA:
Option Explicit
Public Sub GetTables()
'VBE > Tools > References > Microsoft Scripting Runtime
Dim json As Object, qualifyingUrlsDates(), urls(), url As String
Dim ws As Worksheet, wsOutput As Worksheet, i As Long, startDate As String, endDate As String
Set ws = ThisWorkbook.Worksheets("Sheet1")
urls = GetAllUrls(2, ws, "A")
qualifyingUrlsDates = GetOnlyQualifyingUrlsDates(urls)
'API key may be not be valid over time so look at obtaining by prior request
With CreateObject("MSXML2.XMLHTTP") 'issue xmlhttp request for each valid date (this would be better done using start and enddate to specify entire range _
of batches e.g. months within total range to cut down on requests
For i = LBound(qualifyingUrlsDates) To UBound(qualifyingUrlsDates)
startDate = qualifyingUrlsDates(i)
endDate = startDate ' a little verbose but useful for explaining
url = "https://api.weather.com/v1/geocode/31.76/-106.49/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&startDate=" & startDate & "&endDate=" & endDate & "&units=e"
.Open "GET", url, False
.send
Set json = JsonConverter.ParseJson(.responseText)("observations")
Set wsOutput = CreateWorksheet(qualifyingUrlsDates(i))
WriteOutResults wsOutput, json
Next
End With
End Sub
Public Sub WriteOutResults(ByVal wsOutput As Worksheet, ByVal json As Object)
'json is a collection of dictionaries. Each dictionary is a time period reading from the day i.e. one row in output
Dim results(), item As Object, headers(), r As Long, c As Long, key As Variant
headers = json.item(1).keys 'get the headers which are the keys of each dictionary
ReDim results(1 To json.Count, 1 To UBound(headers) + 1)
For Each item In json
r = r + 1: c = 0 'increase row in results array to store results for table row
For Each key In item.keys
c = c + 1 'increase column number in results array for writing out results
Select Case key
Case "valid_time_gmt", "expire_time_gmt" 'convert unix timestamp fields to datetime
results(r, c) = Epoch2Date(item(key))
Case Else
results(r, c) = item(key)
End Select
Next
Next
With wsOutput
.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Public Function GetOnlyQualifyingUrlsDates(ByVal urls As Variant) As Variant
Dim i As Long, output(), counter As Long
ReDim output(1 To UBound(urls))
For i = LBound(urls) To UBound(urls)
If IncludeThisDate(urls(i)) Then 'check if weekday is to be included
counter = counter + 1
output(counter) = Format$(Right$(urls(i), 8), "yyyymmdd") 'if to include then add to output array of urls of interest
End If
Next
ReDim Preserve output(1 To counter)
GetOnlyQualifyingUrlsDates = output
End Function
Public Function IncludeThisDate(ByVal url As String) As Boolean
'tue, wed, thurs are valid
IncludeThisDate = Not IsError(Application.Match(Weekday(Right$(url, 8), vbSunday), Array(3, 4, 5)))
End Function
Public Function SheetExists(ByVal sheetName As String) As Boolean '<== function by #Rory
SheetExists = Evaluate("ISREF('" & sheetName & "'!A1)")
End Function
Public Function GetAllUrls(ByVal startRow As Long, ByVal ws As Worksheet, ByVal columnName As String) As Variant
'transpose used based on premise no more than a couple of years of dates
'startRow is start row for urls, ws is sheet where urls found, columnName is string representation of column for urls e.g. "A"
With ws
GetAllUrls = Application.Transpose(ws.Range("A" & startRow & ":A" & .Cells(.rows.Count, columnName).End(xlUp).Row).Value)
End With
End Function
Public Function CreateWorksheet(ByVal sheetName As String) As Worksheet
Dim ws As Worksheet
If SheetExists(sheetName) Then
Set ws = ThisWorkbook.Worksheets(sheetName)
'do something.... clear it? Then add new data to it?
Else
Set ws = ThisWorkbook.Worksheets.Add
ws.Name = sheetName
End If
Set CreateWorksheet = ws
End Function
Public Function Epoch2Date(ByVal E As Currency, Optional msFrac) As Date '# Schmidt http://www.vbforums.com/showthread.php?805245-EPOCH-to-Date-and-vice-versa
Const Estart As Double = #1/1/1970#
msFrac = 0
If E > 10000000000# Then E = E * 0.001: msFrac = E - Int(E)
Epoch2Date = Estart + (E - msFrac) / 86400
End Function

VBA Code Scraper not placing data in right columns

The code works fine, but I need it to extract ONLY emails and URLs and place the email in Sheet1 "Scraper" NEXT BLANK ROW
Emails = Column A
Urls = Column B
Currently it extracts anything text, emails or URL and places them in column A or B.
I only need Emails or URLs. I have been stuck on this for sometime and can't seem to work it out
Also I am not sure if my DELETE DUPLICATES is deleting duplicate rows or duplicates in column. It SHOULD be duplicate rows.
How the code works:
On Sheet2 "URL List" I have a list of URLs, the code runs through this and places the results onto Sheet1 "Scraper". and deletes any duplicates
It is only supposed to scraper email and URLs and place them in Column A,B on NEXT BLANK ROW.
I have tried to fix the problem but it is out of my scope.
Private Sub fbStart_Click()
'Set sheet2 URL List and open Internet Explorer
Dim lr As Long
Dim x As Long
Dim arr() As Variant
Dim wks As Worksheet
Dim ie As Object
Dim dd(1 To 2) As String
Dim Fr As Long
On Error Resume Next
Application.ScreenUpdating = False
Set wks = ThisWorkbook.Sheets("Url List")
With wks
Fr = .Cells(.Rows.Count, 6).End(xlUp).Offset(1).Row
lr = .Cells(.Rows.Count, 1).End(xlUp).Row
.Cells(1, 5).Value = lr
arr = .Range(.Cells(Fr, 1), .Cells(lr, 1)).Value
End With
'Show Internet Explorer and add delay in seconds if needed
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = True
Application.Wait Now + TimeValue("0:00:0")
For x = LBound(arr, 1) To UBound(arr, 1)
.navigate arr(x, 1)
wtime = Time
Do While .Busy Or .readyState <> 4
DoEvents
'Skip pages with Captchas + write the word Captcha in Sheet 2 Column C
If Time > (wtime + TimeValue("00:00:10")) Then
Cells(x + 1, "C").Value = "Captcha"
Exit Do
End If
Loop
On Error Resume Next
'Variable for document or data which need to be extracted out of webpage, change innertext number if same class used
Dim doc As HTMLDocument
Set doc = ie.document
dd(1) = doc.getElementsByClassName("_50f4")(2).innerText
dd(2) = doc.getElementsByClassName("_50f4")(3).innerText
'Paste the web data into Sheet1 "Scraper" in next BLANK ROW
With Sheet1
.Cells(.Rows.Count, "A").End(xlUp).Offset(1, 0).Resize(, 2).Value = dd
End With
' Put A number 1 in Sheet2 "Url List"column B to notify this URL is done
Sheets("Url List").Range("B" & Rows.Count).End(xlUp).Offset(1, 0).Value = 1
'Deletes duplicates in column A Sheet1
Columns(1).RemoveDuplicates Columns:=Array(1)
Columns(2).RemoveDuplicates Columns:=Array(1)
'Count No1 in sheet2 Column B
With Worksheets("Url List")
Lastrow = .Cells(.Rows.Count, "B").End(xlUp).Row
Sheets("Url List").Range("B1").Value = Lastrow
End With
Call Autoclick_Click
Next x
.Quit
End With
'Hide FaceBook Scraper Form
ScraperForm.Hide
End Sub
Below is to show you how to handle finding email and website address. You already have your loop and de-duplicate. Below are helper methods for extracting the required info. You can simply assign from the variables email and website to your cells in loop. I show a method using a helper function to determine lastRow in target sheet and writing out variables to correct columns in one go.
I can help with implementing the loop integration if needed, but the emphasis here was on explaining what could be done for identifying those elements of interest and how to write out to correct columns. Tbh - de-duplicating is so easily done in sheet at end but you can also use macro recorder to get perfectly functional code for that single step/use existing SO answers.
tl;dr;
This would be a lot easier if :contains / :has css pseudo classes were permitted. Instead, my approach is as follows:
email - find the href attribute whose value starts with mailto
website - check that there is a website icon on the page
Specify a parent of both the website icon and the website address
Loop all matches to that parent specification checking if contains the website icon (this is where pseudo class selectors would have simplified things). If match found then we have the shared parent of both icon and hopefully website address; use childOfSiblingCssSelector (we are looking at a child of the following div in this case) css selector to extract the website url.
Notes:
The entire thing is kept quite high level/generic such that you can adjust your css selectors to hopefully cater for different scenarios. Consequence - may seem a little verbose.
Helper functions are provided to handle element matching. Name these in a way that makes sense for what they are doing. I think some room for improvement here.
Whilst technically the second helper, GetText, could be used to extract the email address (I'd probably add another argument to function call to specify attribute to extract) as well as website address, it seems far quicker, currently, to simply target the appropriate href as detailed above.
I have kept the css selectors as local variables close to their usage; you could have them as constants, closer to top of module, where easier to access perhaps? Unsure without seeing how this performs over time/different urls
Css selectors are chosen over .getElementsBy methods for two reasons: 1) there is browser optimization for css selectors so, if well formulated, css will be faster 2) I want to preserve the flexibility of the code/helper functions - you have far more specificity with css selectors in terms of what patterns you can express. I deemed this important as I don't know what future cases you may need to handle.
I am deliberately not using class name and index e.g. doc.getElementsByClassName("_50f4")(2).innerText, as I am unfamiliar with the range of potential use cases; this feels a little fragile as relies on consistent ordering and numbering of elements (at least up to these indices).
TODO:
Rather than instantiate a new HTMLDocument each time in GetText, it is more efficient to pass another HTMLDocument argument in the function signature i.e. from calling procedure. A re-factor could take that into consideration.
This type of coding might lend itself to being class based in the future. Particularly if error handling is to be added and further functions.
VBA:
Option Explicit
'VBE > Tools > References > HTML Object Library
Public Sub test()
Dim ie As Object, ws As Worksheet
Set ie = CreateObject("InternetExplorer.Application")
Set ws = ThisWorkbook.Worksheets("Scraper")
With ie
.Visible = True
.Navigate2 "https://www.facebook.com/pg/SalemFordNH/about/?ref=page_internal%5Blink%5D"
While .Busy Or .readyState < 4: DoEvents: Wend
With .document
Dim email As String, website As String, iconCssSelector As String
'iconCssSelector for website icon in this instance
iconCssSelector = "[src='https://static.xx.fbcdn.net/rsrc.php/v3/yV/r/EaDvTjOwxIV.png']"
If ElementIsPresent(ie.document, "[href^=mailto]") Then
email = ie.document.querySelector("[href^=mailto]").innerText
Else
email = "Not found"
End If
Dim parents As Object, sharedParentCssSelector As String, childOfSiblingCssSelector As String
sharedParentCssSelector = "._5aj7" 'target parent of both icon and the website link
childOfSiblingCssSelector = "._50f4" '< to target website address after finding right parent
If ElementIsPresent(ie.document, iconCssSelector) _
And ElementIsPresent(ie.document, sharedParentCssSelector) Then
Set parents = ie.document.querySelectorAll(sharedParentCssSelector) 'css selector used to allow for greater flexibility in element matching
website = GetText(ie.document, parents, iconCssSelector, childOfSiblingCssSelector)
Else
website = "Not found"
End If
End With
'Assumes headers already present
Dim nextRow As Long
nextRow = GetLastRow(ws, 1) + 1
ws.Cells(nextRow, 1).Resize(1, 2) = Array(email, website)
.Quit
End With
End Sub
Public Function ElementIsPresent(ByVal document As HTMLDocument, ByVal cssSelector As String) As Boolean
ElementIsPresent = document.querySelectorAll(cssSelector).length > 0
End Function
Public Function GetText(ByVal document As HTMLDocument, ByVal parents As Object, ByVal iconCssSelector As String, ByVal childOfSiblingCssSelector As String) As String
'in this instance and with microsoft IE DOM you cannot select for parent of an element with pseudo class _
of :has(>child); nor use :contains... instead pass expected parent selector, that houses _
both the icon element for website and the website address itself, and loop all matches checking for website icon _
if found use childOfSiblingCssSelector to extract
Dim i As Long, html As HTMLDocument
Set html = New HTMLDocument
For i = 0 To parents.length - 1
html.body.innerHTML = parents.item(i).innerHTML
If ElementIsPresent(html, iconCssSelector) Then
GetText = html.querySelector(childOfSiblingCssSelector).innerText
Exit Function
End If
Next
GetText = "Not found"
End Function
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
End With
End Function
Project references (VBE > Tools > References):
Microsoft HTML Object Library
Additional reading:
https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll
https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector
Edit:
Example of loop - assumes no empty rows in column A between urls.
Option Explicit
'VBE > Tools > References > HTML Object Library
Public Sub test()
Dim ie As Object, ws As Worksheet, wsUrls As Worksheet, urls()
Set ie = CreateObject("InternetExplorer.Application")
Set ws = ThisWorkbook.Worksheets("Scraper")
Set wsUrls = ThisWorkbook.Worksheets("Url List")
With wsUrls
urls = Application.Transpose(.Range("A2:A" & .Cells(.rows.Count, "A").End(xlUp).Row).Value)
End With
Dim results(), r As Long
ReDim results(1 To UBound(urls), 1 To 2)
With ie
.Visible = True
For r = LBound(urls) To UBound(urls)
.Navigate2 urls(r)
While .Busy Or .readyState < 4: DoEvents: Wend
With .document
Dim email As String, website As String, iconCssSelector As String
'iconCssSelector for website icon in this instance
iconCssSelector = "[src='https://static.xx.fbcdn.net/rsrc.php/v3/yV/r/EaDvTjOwxIV.png']"
If ElementIsPresent(ie.document, "[href^=mailto]") Then
email = ie.document.querySelector("[href^=mailto]").innerText
Else
email = "Not found"
End If
Dim parents As Object, sharedParentCssSelector As String, childOfSiblingCssSelector As String
sharedParentCssSelector = "._5aj7" 'target parent of both icon and the website link
childOfSiblingCssSelector = "._50f4" '< to target website address after finding right parent
If ElementIsPresent(ie.document, iconCssSelector) _
And ElementIsPresent(ie.document, sharedParentCssSelector) Then
Set parents = ie.document.querySelectorAll(sharedParentCssSelector) 'css selector used to allow for greater flexibility in element matching
website = GetText(ie.document, parents, iconCssSelector, childOfSiblingCssSelector)
Else
website = "Not found"
End If
End With
'Assumes headers already present
Dim nextRow As Long
results(r, 1) = email
results(r, 2) = website
Next
.Quit
End With
nextRow = GetLastRow(ws, 1) + 1
ws.Cells(nextRow, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub
Public Function ElementIsPresent(ByVal document As HTMLDocument, ByVal cssSelector As String) As Boolean
ElementIsPresent = document.querySelectorAll(cssSelector).length > 0
End Function
Public Function GetText(ByVal document As HTMLDocument, ByVal parents As Object, ByVal iconCssSelector As String, ByVal childOfSiblingCssSelector As String) As String
'in this instance and with microsoft IE DOM you cannot select for parent of an element with pseudo class _
of :has(>child); nor use :contains... instead pass expected parent selector, that houses _
both the icon element for website and the website address itself, and loop all matches checking for website icon _
if found use childOfSiblingCssSelector to extract
Dim i As Long, html As HTMLDocument
Set html = New HTMLDocument
For i = 0 To parents.length - 1
html.body.innerHTML = parents.item(i).innerHTML
If ElementIsPresent(html, iconCssSelector) Then
GetText = html.querySelector(childOfSiblingCssSelector).innerText
Exit Function
End If
Next
GetText = "Not found"
End Function
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
End With
End Function

Resources