VBA Code Scraper not placing data in right columns - excel

The code works fine, but I need it to extract ONLY emails and URLs and place the email in Sheet1 "Scraper" NEXT BLANK ROW
Emails = Column A
Urls = Column B
Currently it extracts anything text, emails or URL and places them in column A or B.
I only need Emails or URLs. I have been stuck on this for sometime and can't seem to work it out
Also I am not sure if my DELETE DUPLICATES is deleting duplicate rows or duplicates in column. It SHOULD be duplicate rows.
How the code works:
On Sheet2 "URL List" I have a list of URLs, the code runs through this and places the results onto Sheet1 "Scraper". and deletes any duplicates
It is only supposed to scraper email and URLs and place them in Column A,B on NEXT BLANK ROW.
I have tried to fix the problem but it is out of my scope.
Private Sub fbStart_Click()
'Set sheet2 URL List and open Internet Explorer
Dim lr As Long
Dim x As Long
Dim arr() As Variant
Dim wks As Worksheet
Dim ie As Object
Dim dd(1 To 2) As String
Dim Fr As Long
On Error Resume Next
Application.ScreenUpdating = False
Set wks = ThisWorkbook.Sheets("Url List")
With wks
Fr = .Cells(.Rows.Count, 6).End(xlUp).Offset(1).Row
lr = .Cells(.Rows.Count, 1).End(xlUp).Row
.Cells(1, 5).Value = lr
arr = .Range(.Cells(Fr, 1), .Cells(lr, 1)).Value
End With
'Show Internet Explorer and add delay in seconds if needed
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = True
Application.Wait Now + TimeValue("0:00:0")
For x = LBound(arr, 1) To UBound(arr, 1)
.navigate arr(x, 1)
wtime = Time
Do While .Busy Or .readyState <> 4
DoEvents
'Skip pages with Captchas + write the word Captcha in Sheet 2 Column C
If Time > (wtime + TimeValue("00:00:10")) Then
Cells(x + 1, "C").Value = "Captcha"
Exit Do
End If
Loop
On Error Resume Next
'Variable for document or data which need to be extracted out of webpage, change innertext number if same class used
Dim doc As HTMLDocument
Set doc = ie.document
dd(1) = doc.getElementsByClassName("_50f4")(2).innerText
dd(2) = doc.getElementsByClassName("_50f4")(3).innerText
'Paste the web data into Sheet1 "Scraper" in next BLANK ROW
With Sheet1
.Cells(.Rows.Count, "A").End(xlUp).Offset(1, 0).Resize(, 2).Value = dd
End With
' Put A number 1 in Sheet2 "Url List"column B to notify this URL is done
Sheets("Url List").Range("B" & Rows.Count).End(xlUp).Offset(1, 0).Value = 1
'Deletes duplicates in column A Sheet1
Columns(1).RemoveDuplicates Columns:=Array(1)
Columns(2).RemoveDuplicates Columns:=Array(1)
'Count No1 in sheet2 Column B
With Worksheets("Url List")
Lastrow = .Cells(.Rows.Count, "B").End(xlUp).Row
Sheets("Url List").Range("B1").Value = Lastrow
End With
Call Autoclick_Click
Next x
.Quit
End With
'Hide FaceBook Scraper Form
ScraperForm.Hide
End Sub

Below is to show you how to handle finding email and website address. You already have your loop and de-duplicate. Below are helper methods for extracting the required info. You can simply assign from the variables email and website to your cells in loop. I show a method using a helper function to determine lastRow in target sheet and writing out variables to correct columns in one go.
I can help with implementing the loop integration if needed, but the emphasis here was on explaining what could be done for identifying those elements of interest and how to write out to correct columns. Tbh - de-duplicating is so easily done in sheet at end but you can also use macro recorder to get perfectly functional code for that single step/use existing SO answers.
tl;dr;
This would be a lot easier if :contains / :has css pseudo classes were permitted. Instead, my approach is as follows:
email - find the href attribute whose value starts with mailto
website - check that there is a website icon on the page
Specify a parent of both the website icon and the website address
Loop all matches to that parent specification checking if contains the website icon (this is where pseudo class selectors would have simplified things). If match found then we have the shared parent of both icon and hopefully website address; use childOfSiblingCssSelector (we are looking at a child of the following div in this case) css selector to extract the website url.
Notes:
The entire thing is kept quite high level/generic such that you can adjust your css selectors to hopefully cater for different scenarios. Consequence - may seem a little verbose.
Helper functions are provided to handle element matching. Name these in a way that makes sense for what they are doing. I think some room for improvement here.
Whilst technically the second helper, GetText, could be used to extract the email address (I'd probably add another argument to function call to specify attribute to extract) as well as website address, it seems far quicker, currently, to simply target the appropriate href as detailed above.
I have kept the css selectors as local variables close to their usage; you could have them as constants, closer to top of module, where easier to access perhaps? Unsure without seeing how this performs over time/different urls
Css selectors are chosen over .getElementsBy methods for two reasons: 1) there is browser optimization for css selectors so, if well formulated, css will be faster 2) I want to preserve the flexibility of the code/helper functions - you have far more specificity with css selectors in terms of what patterns you can express. I deemed this important as I don't know what future cases you may need to handle.
I am deliberately not using class name and index e.g. doc.getElementsByClassName("_50f4")(2).innerText, as I am unfamiliar with the range of potential use cases; this feels a little fragile as relies on consistent ordering and numbering of elements (at least up to these indices).
TODO:
Rather than instantiate a new HTMLDocument each time in GetText, it is more efficient to pass another HTMLDocument argument in the function signature i.e. from calling procedure. A re-factor could take that into consideration.
This type of coding might lend itself to being class based in the future. Particularly if error handling is to be added and further functions.
VBA:
Option Explicit
'VBE > Tools > References > HTML Object Library
Public Sub test()
Dim ie As Object, ws As Worksheet
Set ie = CreateObject("InternetExplorer.Application")
Set ws = ThisWorkbook.Worksheets("Scraper")
With ie
.Visible = True
.Navigate2 "https://www.facebook.com/pg/SalemFordNH/about/?ref=page_internal%5Blink%5D"
While .Busy Or .readyState < 4: DoEvents: Wend
With .document
Dim email As String, website As String, iconCssSelector As String
'iconCssSelector for website icon in this instance
iconCssSelector = "[src='https://static.xx.fbcdn.net/rsrc.php/v3/yV/r/EaDvTjOwxIV.png']"
If ElementIsPresent(ie.document, "[href^=mailto]") Then
email = ie.document.querySelector("[href^=mailto]").innerText
Else
email = "Not found"
End If
Dim parents As Object, sharedParentCssSelector As String, childOfSiblingCssSelector As String
sharedParentCssSelector = "._5aj7" 'target parent of both icon and the website link
childOfSiblingCssSelector = "._50f4" '< to target website address after finding right parent
If ElementIsPresent(ie.document, iconCssSelector) _
And ElementIsPresent(ie.document, sharedParentCssSelector) Then
Set parents = ie.document.querySelectorAll(sharedParentCssSelector) 'css selector used to allow for greater flexibility in element matching
website = GetText(ie.document, parents, iconCssSelector, childOfSiblingCssSelector)
Else
website = "Not found"
End If
End With
'Assumes headers already present
Dim nextRow As Long
nextRow = GetLastRow(ws, 1) + 1
ws.Cells(nextRow, 1).Resize(1, 2) = Array(email, website)
.Quit
End With
End Sub
Public Function ElementIsPresent(ByVal document As HTMLDocument, ByVal cssSelector As String) As Boolean
ElementIsPresent = document.querySelectorAll(cssSelector).length > 0
End Function
Public Function GetText(ByVal document As HTMLDocument, ByVal parents As Object, ByVal iconCssSelector As String, ByVal childOfSiblingCssSelector As String) As String
'in this instance and with microsoft IE DOM you cannot select for parent of an element with pseudo class _
of :has(>child); nor use :contains... instead pass expected parent selector, that houses _
both the icon element for website and the website address itself, and loop all matches checking for website icon _
if found use childOfSiblingCssSelector to extract
Dim i As Long, html As HTMLDocument
Set html = New HTMLDocument
For i = 0 To parents.length - 1
html.body.innerHTML = parents.item(i).innerHTML
If ElementIsPresent(html, iconCssSelector) Then
GetText = html.querySelector(childOfSiblingCssSelector).innerText
Exit Function
End If
Next
GetText = "Not found"
End Function
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
End With
End Function
Project references (VBE > Tools > References):
Microsoft HTML Object Library
Additional reading:
https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll
https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector
Edit:
Example of loop - assumes no empty rows in column A between urls.
Option Explicit
'VBE > Tools > References > HTML Object Library
Public Sub test()
Dim ie As Object, ws As Worksheet, wsUrls As Worksheet, urls()
Set ie = CreateObject("InternetExplorer.Application")
Set ws = ThisWorkbook.Worksheets("Scraper")
Set wsUrls = ThisWorkbook.Worksheets("Url List")
With wsUrls
urls = Application.Transpose(.Range("A2:A" & .Cells(.rows.Count, "A").End(xlUp).Row).Value)
End With
Dim results(), r As Long
ReDim results(1 To UBound(urls), 1 To 2)
With ie
.Visible = True
For r = LBound(urls) To UBound(urls)
.Navigate2 urls(r)
While .Busy Or .readyState < 4: DoEvents: Wend
With .document
Dim email As String, website As String, iconCssSelector As String
'iconCssSelector for website icon in this instance
iconCssSelector = "[src='https://static.xx.fbcdn.net/rsrc.php/v3/yV/r/EaDvTjOwxIV.png']"
If ElementIsPresent(ie.document, "[href^=mailto]") Then
email = ie.document.querySelector("[href^=mailto]").innerText
Else
email = "Not found"
End If
Dim parents As Object, sharedParentCssSelector As String, childOfSiblingCssSelector As String
sharedParentCssSelector = "._5aj7" 'target parent of both icon and the website link
childOfSiblingCssSelector = "._50f4" '< to target website address after finding right parent
If ElementIsPresent(ie.document, iconCssSelector) _
And ElementIsPresent(ie.document, sharedParentCssSelector) Then
Set parents = ie.document.querySelectorAll(sharedParentCssSelector) 'css selector used to allow for greater flexibility in element matching
website = GetText(ie.document, parents, iconCssSelector, childOfSiblingCssSelector)
Else
website = "Not found"
End If
End With
'Assumes headers already present
Dim nextRow As Long
results(r, 1) = email
results(r, 2) = website
Next
.Quit
End With
nextRow = GetLastRow(ws, 1) + 1
ws.Cells(nextRow, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub
Public Function ElementIsPresent(ByVal document As HTMLDocument, ByVal cssSelector As String) As Boolean
ElementIsPresent = document.querySelectorAll(cssSelector).length > 0
End Function
Public Function GetText(ByVal document As HTMLDocument, ByVal parents As Object, ByVal iconCssSelector As String, ByVal childOfSiblingCssSelector As String) As String
'in this instance and with microsoft IE DOM you cannot select for parent of an element with pseudo class _
of :has(>child); nor use :contains... instead pass expected parent selector, that houses _
both the icon element for website and the website address itself, and loop all matches checking for website icon _
if found use childOfSiblingCssSelector to extract
Dim i As Long, html As HTMLDocument
Set html = New HTMLDocument
For i = 0 To parents.length - 1
html.body.innerHTML = parents.item(i).innerHTML
If ElementIsPresent(html, iconCssSelector) Then
GetText = html.querySelector(childOfSiblingCssSelector).innerText
Exit Function
End If
Next
GetText = "Not found"
End Function
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
End With
End Function

Related

I have questions about how to click and search on web using vba

I have questions about how to click and search on web using vba.
I have wrote the code, but cannot find how to click the button in this web
Sub LEISearch()
'dimension (declare or set aside memory for) our variables
Dim objIE As InternetExplorer 'special object variable representing the IE browser
Dim LEI As HTMLLinkElement 'special object variable for an <a> (link) element
Dim y As Integer 'integer variable we'll use as a counter
Dim result As String 'string variable that will hold our result link'
Dim result2 As String
'initiating a new instance of Internet Explorer and asigning it to objIE
Set objIE = New InternetExplorer
'make IE browser visible (False would allow IE to run in the background)
objIE.Visible = True
'navigate IE to this web page (a pretty neat search engine really)
objIE.navigate "https://www.gmeiutility.org/search.jsp?keyWord"
'wait here a few seconds while the browser is busy
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
'in the search box put cell "A2" value, the word "in" and cell "C1" value
objIE.document.getElementById("searchInput").Value = _
Sheets("Macro1").Range("A1").Value
'click the 'go' button
Set LEIButton = objIE.document.getElementsByClassName("hiddenSubmitButton")
LEIButton.Focus
LEIButton.Click
'wait again for the browser
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
End Sub
This case is a really nice and clean example of web scraping so i will take this opportunity to present an educational post.
I highly recommend avoiding using IE to scrape websites whenever it's possible. It's highly inefficient. Especially in a case like this where there can be multiple pages of results. Instead, you can use HTTP requests.
An HTTP request is a structured way to request something from a server. In this case we want to send a keyword to the server and get the corresponding search results.
To find out how this request should look like, you have to inspect the network traffic when the button with the magnifying glass is clicked. You can do that through your browser's developer tools (Ctrl+Shift+E if you're using Firefox):
If you go through the Headers and the Params of the request you will see how the url, the body and the headers should look like. In this particular case, all the parameters are encoded into the url and the headers are not essential to the success of the request, so all you need is the url.
Some of the parameters of the url are the keyword, the number of results per page and the number of page.
The response's payload is in json format. You can inspect its structure using a tool like this. Here's how it looks like:
Basically the JSON response consists of as many results as you have specified that should be displayed per page (or less). To get the next page you need to send a new request with the same keyword but specifying a new page number and so on.
In fact, as you can see the website offers a lot more data than what's displayed on your browser, which could prove to be useful.
The code below searches for the keyword test, while requesting 25 results per page. One first request is sent to find out how many pages of results are there and then the code loops through all pages and prints the results in a worksheet.
TL;DR
Option Explicit
Sub main()
Dim sht As Worksheet
Dim totalNumberOfPages As Long
Dim searchResults As Object
Dim pageNumber As Long
Dim results() As String
Dim entity As Object
Dim i As Long, j As Long
Dim rng As Range
Set sht = ThisWorkbook.Worksheets("Name of your Worksheet")
''''''First request to find out the number of pages''''''
Set searchResults = XHRrequest("test", 25, 1) '
totalNumberOfPages = searchResults("totalPages") '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''
'''''''''''''''''''Loop through all the pages''''''''''''''''''''''''''''''''
For pageNumber = 1 To totalNumberOfPages Step 1 '
Set searchResults = XHRrequest("test", 25, pageNumber) '
ReDim results(1 To searchResults("entitySearchResult").Count, 1 To 7) '
i = 0 '
'''''''''''write the results in an array'''''''''''''''''''''''''''' '
For Each entity In searchResults("entitySearchResult") ' '
i = i + 1 ' '
results(i, 1) = entity("LEINumber") ' '
results(i, 2) = entity("legalName") ' '
results(i, 3) = entity("city") ' '
results(i, 4) = entity("headquartersCountry") ' '
results(i, 5) = entity("recordStatus") ' '
results(i, 6) = entity("renewalStatus") ' '
results(i, 7) = entity("entityStatus") ' '
Next entity ' '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' '
'''''''''''''''write all the results in the worksheet in one go''''' '
With sht ' '
Set rng = .Range("A" & .Rows.Count).End(xlUp).Offset(1, 0) ' '
End With ' '
rng.Resize(UBound(results, 1), UBound(results, 2)) = results ' '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' '
Next pageNumber '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
End Sub
Public Function XHRrequest(ByVal key As String, ByVal resultsPerPage As Long, ByVal pageNumber As Long) As Object
Dim req As New WinHttpRequest
Dim url As String
url = "https://www.gmeiutility.org/actions/Search/?isPendingValidationChecked=true&isSearchAllLOUChecked=true&keyWord=" & key & "&page=" & pageNumber & "&resultsPerPage=" & resultsPerPage & "&searchType=baseSearch" 'build the URL according to the parameters
'''''''''Send the HTTP request'''''''''''''''''''''''''''''''
With req '
.Open "POST", url, False '
.send '
Set XHRrequest = JsonConverter.ParseJson(.responseText) '
End With '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
End Function
For demonstration purposes the code above prints all the data in worksheet named Name of your Worksheet.
If you need to perform multiple searches you can easily modify the code to best fit your needs. More specifically you can loop through multiple keywords and call the XHRrequest function using those keywords instead of "test".
Here's a sample of the output:
You will need to add the following references to your project (VBE>Tools>References):
Microsoft WinHTTP Services version 5.1
Microsoft HTML Objects Library
Microsoft Scripting Runtime
You will also need to add this JSON parser to your project. Follow the installation instructions in the link and you should be set to go.

Is there a way to slow down a Web Scraper so it will pick up the code?

I wrote a macro to go to WU to get historical data and for the most part, it works. However, I believe that the macro is running too fast for it to pick up the data from the website.
https://www.wunderground.com/history/daily/us/tx/el-paso/KELP/date/2017-1-3
Is the website and the table I want to get is tablesaw-sortable.
I have tried the following: DoEvents and Application.Wait (Now + TimeValue("00:00:01")) to try to slow down the process.
Sub BrowseToWU()
Dim IE As New SHDocVw.InternetExplorer
Dim HTMLDoc As MSHTML.HTMLDocument
Dim RowAddress As Integer
Dim WebAddress As String
Dim DateSheet As Date
Dim WkDay As Integer
Dim DateSheetName As String
'Application.ScreenUpdating = False
'Application.StatusBar = True
RowAddress = 2
IE.Visible = True
Do Until RowAddress = 60
WebAddress = Range("A" & RowAddress)
DateSheet = Right(WebAddress, 8)
DateSheetName = Right(WebAddress, 8)
WkDay = Weekday(DateSheet, vbSunday)
If WkDay < 3 Then
RowAddress = RowAddress + 1
ElseIf WkDay > 6 Then
RowAddress = RowAddress + 1
Else
IE.Navigate WebAddress
Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop
Set HTMLDoc = IE.Document
DoEvents
Application.Wait (Now + TimeValue("00:00:05"))
DoEvents
ProcessHTMLPage HTMLDoc
DateSheet = Right(WebAddress, 8)
DoEvents
Application.Wait (Now + TimeValue("00:00:01"))
ActiveSheet.Name = DateSheetName
DoEvents
RowAddress = RowAddress + 1
'IE.Quit
Worksheets("Sheet1").Activate
End If
Loop
End Sub
Option Explicit
Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)
Dim HTMLTable As MSHTML.IHTMLElement
Dim HTMLTables As MSHTML.IHTMLElementCollection
Dim HTMLRow As MSHTML.IHTMLElement
Dim HTMLCell As MSHTML.IHTMLElement
Dim RowNum As Long, ColNum As Integer
'Dim IE As New SHDocVw.InternetExplorer
'Dim Ws As Worksheet
Set HTMLTables = HTMLPage.getElementsByClassName("tablesaw-sortable")
'DoEvents
For Each HTMLTable In HTMLTables
Worksheets.Add
DoEvents
Range("A1").Value = HTMLTable.className
Range("B1").Value = Now
RowNum = 2
For Each HTMLRow In HTMLTable.getElementsByTagName("tr")
'Debug.Print vbTab & HTMLRow.innerText
ColNum = 1
For Each HTMLCell In HTMLRow.Children
Cells(RowNum, ColNum) = HTMLCell.innerText
ColNum = ColNum + 1
Next HTMLCell
RowNum = RowNum + 1
Next HTMLRow
Next HTMLTable
DoEvents
'IE.Quit
End Sub
The macro is supposed to run through sheet1 picking up the web address to the historical data if it satisfies the criteria of being a certain day of the week.
IE will open and then it will kick over to the next module that will take in the data.
A new worksheet is created and the data pasted into the new worksheet.
The worksheet is renamed to the date of the data.
The web address sheet is activated again and the process starts over again.
The error I get is that the data isn't taken from the website, so the For statement ends and the web address sheet is renamed and an error occurs.
One way around this is to call the API that the page is using to get that info.
The API returns json which you can parse with a json parser. I use jsonconverter.bas. After installing the code from that link in a standard module called JsonConverter, go to VBE > Tools > References > Add a reference to Microsoft Scripting Runtime.
Finding the API:
If you press F12 to open developer tools and go to the Network tab and then press F5 to refresh any url of interest you will see the recorded web traffic. You can find the API call there.
See my answer here on how to search the network traffic using a specific observation value you expect to see in the response - this will filter the list of network traffic to those items containing the value of interest. Be sensible in selecting the value - you want something unlikely to occur elsewhere. You can also filter the network traffic to XHR only.
The API response:
The API returns json. More specifically, it returns a dictionary containing 2 keys. The second key, "observations", can be used to return a collection (denoted by []) of dictionaries (denoted by {}).
Each dictionary represents a row of the table (daily observations). You can loop this collection, and then loop the inner dictionaries, to access the table row values and reconstruct the table by populating an array. Explore example json response here.
Explanation of json structure:
click here to enlarge
Explanation of code:
The code is broken down into a number of helper subs and functions, allocating certains tasks to each, to
make code easier to debug and follow, as well as better align with Object Oriented Programming Principles.
Overall the process is:
Gather urls for Worksheet("Sheet1"). Helper function GetAllUrls.
Process those urls and only retain the dates which correspond with Tue-Thur. These are kept as strings formatted as "yyyymmdd" so can be passed to API later. This is handled by helper functions GetOnlyQualifyingUrlsDates and IncludeThisDate. IncludeThisDate performs the check for whether to include; GetOnlyQualifyingUrlsDates handles the looping and formatting of results.
Issue xmlhttp requests by looping over qualifying url dates and concatenating those into the url for the API call, then issuing the request. This is performed by the main sub GetTables.
Sheet creation, for output, is handled by helper function CreateWorksheet. This function calls another helper function, SheetExists, to ensure sheets are only created if they don't already exist, otherwise, the existing sheet by that name is used.
The resultant json response, from step 3, is passed to a helper sub WriteOutResults which accepts the json variable and the output sheet object as arguments. It extracts all the info from the json response; essentially reconstructing the table. It adds the table and headers to the appropriate sheet.
It calls helper function Epoch2Date, which handles the unix timestamp to datetime conversion for the two unix fields in the json object.
TODO:
The API key may be time limited. Add a helper function which returns the current valid key.
The API accepts start date and end date parameters in the url construct. It would be far better to issue one request for the entire range if possible, or chunked ranges e.g. months, to reduce the number of requests made. This would also reduce the likelihood of being blocked. This would mean some additional code would need to be written, before writing out results, to ensure only dates of interest are being written to sheets. Though you could write out all then simply loop all sheets and delete those that aren't wanted (perfectly doable if we are talking about 365 dates total). Personally, I would handle the include date part in the construction of the table from a single request (if possible) that has the min and max dates for entire urls listed passed as start and end date parameters. I would then write a single flat table out to one sheet as this will be much easier for later data analysis.
VBA:
Option Explicit
Public Sub GetTables()
'VBE > Tools > References > Microsoft Scripting Runtime
Dim json As Object, qualifyingUrlsDates(), urls(), url As String
Dim ws As Worksheet, wsOutput As Worksheet, i As Long, startDate As String, endDate As String
Set ws = ThisWorkbook.Worksheets("Sheet1")
urls = GetAllUrls(2, ws, "A")
qualifyingUrlsDates = GetOnlyQualifyingUrlsDates(urls)
'API key may be not be valid over time so look at obtaining by prior request
With CreateObject("MSXML2.XMLHTTP") 'issue xmlhttp request for each valid date (this would be better done using start and enddate to specify entire range _
of batches e.g. months within total range to cut down on requests
For i = LBound(qualifyingUrlsDates) To UBound(qualifyingUrlsDates)
startDate = qualifyingUrlsDates(i)
endDate = startDate ' a little verbose but useful for explaining
url = "https://api.weather.com/v1/geocode/31.76/-106.49/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&startDate=" & startDate & "&endDate=" & endDate & "&units=e"
.Open "GET", url, False
.send
Set json = JsonConverter.ParseJson(.responseText)("observations")
Set wsOutput = CreateWorksheet(qualifyingUrlsDates(i))
WriteOutResults wsOutput, json
Next
End With
End Sub
Public Sub WriteOutResults(ByVal wsOutput As Worksheet, ByVal json As Object)
'json is a collection of dictionaries. Each dictionary is a time period reading from the day i.e. one row in output
Dim results(), item As Object, headers(), r As Long, c As Long, key As Variant
headers = json.item(1).keys 'get the headers which are the keys of each dictionary
ReDim results(1 To json.Count, 1 To UBound(headers) + 1)
For Each item In json
r = r + 1: c = 0 'increase row in results array to store results for table row
For Each key In item.keys
c = c + 1 'increase column number in results array for writing out results
Select Case key
Case "valid_time_gmt", "expire_time_gmt" 'convert unix timestamp fields to datetime
results(r, c) = Epoch2Date(item(key))
Case Else
results(r, c) = item(key)
End Select
Next
Next
With wsOutput
.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Public Function GetOnlyQualifyingUrlsDates(ByVal urls As Variant) As Variant
Dim i As Long, output(), counter As Long
ReDim output(1 To UBound(urls))
For i = LBound(urls) To UBound(urls)
If IncludeThisDate(urls(i)) Then 'check if weekday is to be included
counter = counter + 1
output(counter) = Format$(Right$(urls(i), 8), "yyyymmdd") 'if to include then add to output array of urls of interest
End If
Next
ReDim Preserve output(1 To counter)
GetOnlyQualifyingUrlsDates = output
End Function
Public Function IncludeThisDate(ByVal url As String) As Boolean
'tue, wed, thurs are valid
IncludeThisDate = Not IsError(Application.Match(Weekday(Right$(url, 8), vbSunday), Array(3, 4, 5)))
End Function
Public Function SheetExists(ByVal sheetName As String) As Boolean '<== function by #Rory
SheetExists = Evaluate("ISREF('" & sheetName & "'!A1)")
End Function
Public Function GetAllUrls(ByVal startRow As Long, ByVal ws As Worksheet, ByVal columnName As String) As Variant
'transpose used based on premise no more than a couple of years of dates
'startRow is start row for urls, ws is sheet where urls found, columnName is string representation of column for urls e.g. "A"
With ws
GetAllUrls = Application.Transpose(ws.Range("A" & startRow & ":A" & .Cells(.rows.Count, columnName).End(xlUp).Row).Value)
End With
End Function
Public Function CreateWorksheet(ByVal sheetName As String) As Worksheet
Dim ws As Worksheet
If SheetExists(sheetName) Then
Set ws = ThisWorkbook.Worksheets(sheetName)
'do something.... clear it? Then add new data to it?
Else
Set ws = ThisWorkbook.Worksheets.Add
ws.Name = sheetName
End If
Set CreateWorksheet = ws
End Function
Public Function Epoch2Date(ByVal E As Currency, Optional msFrac) As Date '# Schmidt http://www.vbforums.com/showthread.php?805245-EPOCH-to-Date-and-vice-versa
Const Estart As Double = #1/1/1970#
msFrac = 0
If E > 10000000000# Then E = E * 0.001: msFrac = E - Int(E)
Epoch2Date = Estart + (E - msFrac) / 86400
End Function

How to scrape data from Bloomberg's website with VBA

Background
Disclaimer: I am a beginner, please bare with my - most plausibly wrong - code.
I want to update currency pairs' value (PREV CLOSE) with a button-enabled-VBA macro. My Excel worksheet contains FX pairs (e.g. USDGBP) on column G:G which are then used to run a FOR loop for every pair in the column.
The value would then be stored in column I:I
Right now, the problem according to the Debugger lies in one line of code that I will highlight below
Sources
I got some inspiration from https://www.youtube.com/watch?v=JxmRjh-S2Ms&t=1050s - notably 17:34 onwards - but I want my code to work for multiple websites at the press of a button.
I have tried the following code
Public Sub Auto_FX_update_BMG()
Application.ScreenUpdating = False 'My computer is not very fast, thus I use this line of
'code to save some computing power and time
Dim internet_object As InternetExplorer
Dim i As Integer
For i = 3 To Sheets(1).Cells(3, 7).End(xlDown).Row
FX_Pair = Sheets(1).Cells(i, 7)
Set internet_object = New InternetExplorer
internet_object.Visible = True
internet_object.navigate "https://www.bloomberg.com/quote/" & FX_Pair & ":CUR"
Application.Wait Now + TimeValue("00:00:05")
internet_object.document.getElementsByClassName("class")(0).getElementsByTagName ("value__b93f12ea") '--> DEBUGGER PROBLEM
'My goal here is to "grab" the PREV CLOSE
'value from the website
With ActiveSheet
.Range(Cells(i, 9)).Value = HTML_element.Children(0).textContent
End With
Sheets(1).Range(Cells(i, 9)).Copy 'Not sure if these 2 lines are unnecesary
ActiveSheet.Paste
Next i
Application.ScreenUpdating = True
End Sub
Expected Result
WHEN I enter "USDGBP" on a cell on column G:G, the macro would go to https://www.bloomberg.com/quote/EURGBP:CUR and "grab" the PREV CLOSE value of 0.8732 (using today's value) and insert it in the respective row of column I:I
As of now, I am just facing the debugger without much idea on how to solve the problem.
You can use class selectors in a loop. The pattern
.previousclosingpriceonetradingdayago .value__b93f12ea
specifies to get child elements with class value__b93f12ea having parent with class previousclosingpriceonetradingdayago. The "." in front is a css class selector and is a faster way of selecting as modern browsers are optimized for css. The space between the two classes is a descendant combinator. querySelector returns the first match for this pattern from the webpage html document.
This matches on the page:
You can see the parent child relationship and classes again here:
<section class="dataBox previousclosingpriceonetradingdayago numeric">
<header class="title__49417cb9"><span>Prev Close</span></header>
<div class="value__b93f12ea">0.8732</div>
</section>
N.B. If you are a Bloomberg customer look into their APIs. Additionally, it is very likely you can get this same info from other dedicated APIs which will allow for much faster and more reliable xhr requests.
VBA (Internet Explorer):
Option Explicit
Public Sub test()
Dim pairs(), ws As Worksheet, i As Long, ie As Object
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set ie = CreateObject("InternetExplorer.Application")
With ws
pairs = Application.Transpose(.Range("G2:G" & .Cells(.rows.Count, "G").End(xlUp).Row).Value) ' assumes pairs start in row 2
End With
Dim results()
ReDim results(1 To UBound(pairs))
With ie
.Visible = True
For i = LBound(pairs) To UBound(pairs)
.Navigate2 "https://www.bloomberg.com/quote/" & pairs(i) & ":CUR", False
While .Busy Or .readyState < 4: DoEvents: Wend
results(i) = .document.querySelector(".previousclosingpriceonetradingdayago .value__b93f12ea").innerText
Next
.Quit
End With
ws.Cells(2, "I").Resize(UBound(results), 1) = Application.Transpose(results)
End Sub
For very limited numbers of requests (as leads to blocking) you could use xhr request and regex out the value. I assume pairs are in sheet one and start from G2. I also assume there are no empty cells or invalid pairs in column G up to an including last pair to search for. Otherwise, you will need to develop the code to handle this.
Try regex here
Option Explicit
Public Sub test()
Dim re As Object, pairs(), ws As Worksheet, i As Long, s As String
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set re = CreateObject("VBScript.RegExp")
With ws
pairs = Application.Transpose(.Range("G2:G" & .Cells(.rows.Count, "G").End(xlUp).Row).Value) ' assumes pairs start in row 2
End With
Dim results()
ReDim results(1 To UBound(pairs))
With CreateObject("MSXML2.XMLHTTP")
For i = LBound(pairs) To UBound(pairs)
.Open "GET", "https://www.bloomberg.com/quote/" & pairs(i) & ":CUR", False
.send
s = .responseText
results(i) = GetCloseValue(re, s, "previousClosingPriceOneTradingDayAgo%22%3A(.*?)%2")
Next
End With
ws.Cells(2, "I").Resize(UBound(results), 1) = Application.Transpose(results)
End Sub
Public Function GetCloseValue(ByVal re As Object, inputString As String, ByVal pattern As String) As String 'https://regex101.com/r/OAyq30/1
With re
.Global = True
.MultiLine = True
.IgnoreCase = False
.pattern = pattern
If .test(inputString) Then
GetCloseValue = .Execute(inputString)(0).SubMatches(0)
Else
GetCloseValue = "Not found"
End If
End With
End Function
Try below code:
But before make sure to add 2 reference by going to Tools> References > then look for Microsoft HTML Object Library and Microsoft Internet Controls
This code works upon using your example.
Sub getPrevCloseValue()
Dim ie As Object
Dim mySh As Worksheet
Set mySh = ThisWorkbook.Sheets("Sheet1")
Dim colG_Value As String
Dim prev_value As String
For a = 3 To mySh.Range("G" & Rows.Count).End(xlUp).Row
colG_Value = mySh.Range("G" & a).Value
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = True
ie.navigate "https://www.bloomberg.com/quote/" & colG_Value & ":CUR"
Do While ie.Busy: DoEvents: Loop
Do Until ie.readyState = 4: DoEvents: Loop
'Application.Wait (Now + TimeValue("00:00:03")) 'activate if having problem with delay
For Each sect In ie.document.getElementsByTagName("section")
If sect.className = "dataBox previousclosingpriceonetradingdayago numeric" Then
prev_value = sect.getElementsByTagName("div")(0).innerText
mySh.Range("I" & a).Value = prev_value
Exit For
End If
Next sect
Next a
I have a video tutorial for basic web automation using vba which include web data scraping and other commands, please check the link below:
https://www.youtube.com/watch?v=jejwXID4OH4&t=700s

Excel Pulling multiple Tables From a Website

I am working on a project to run some analytical models on NFL player stats. I have some code below that another user passed along to me. This code takes a list of links that I have on Sheet1, which is named "PlayerList", and creates a new tab for each player and pulls in their passing stats. All of the links are to Pro Football Reference. I am able to change this code to pull all necessary data for all positions other than quarterback. For the QBs I want to pull the passing stats table as well as the rushing and receiving stats table. Any help would be greatly appreciated. For reference here a few sample links:
https://www.pro-football-reference.com/players/R/RodgAa00.htm
https://www.pro-football-reference.com/players/B/BreeDr00.htm
Below is the code:
Option Explicit
Public Sub GetInfo()
Di If InStr(links(link, 1), "https://") > 0 Then
Set html = GetHTMLDoc(links(link, 1))
Set hTable = html.getElementById("passing")
If Not hTable Is Nothing Then
playerName = GetNameAbbr(links(link, 1))
Set ws = AddPlayerSheet(playerName)
WriteTableToSheet hTable, ws
FixTable ws
End If
End If
Next
Application.ScreenUpdating = True
End Sub
Public Function GetHTMLDoc(ByVal url As String) As HTMLDocument
Dim sResponse As String, html As New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", url, False
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
html.body.innerHTML = sResponse
Set GetHTMLDoc = html
End Function
Public Sub WriteTableToSheet(ByVal hTable As HTMLTable, ByVal ws As Worksheet)
Dim x As Long, y As Long
With hTable
For x = 0 To .Rows.Length - 1
For y = 0 To .Rows(x).Cells.Length - 1
If y = 6 Or y = 7 Then
ws.Cells(x + 4, y + 1).Value = Chr$(39) & .Rows(x).Cells(y).innerText
Else
ws.Cells(x + 4, y + 1).Value = .Rows(x).Cells(y).innerText
End If
Next y
Next x
End With
End Sub
Public Function GetNameAbbr(ByVal url As String)
Dim tempArr() As String
tempArr = Split(url, "/")
GetNameAbbr = Left$(tempArr(UBound(tempArr)), 6)
End Function
Public Function AddPlayerSheet(ByVal playerName As String) As Worksheet
Dim ws As Worksheet
If SheetExists(playerName) Then
Application.DisplayAlerts = False
ThisWorkbook.Worksheets(playerName).Delete
Application.DisplayAlerts = True
End If
Set ws = ThisWorkbook.Worksheets.Add
ws.Name = playerName
Set AddPlayerSheet = ws
End Function
Public Function SheetExists(ByVal playerName As String) As Boolean
SheetExists = Evaluate("ISREF('" & playerName & "'!A1)")
End Function
Public Sub FixTable(ByVal ws As Worksheet)
Dim found As Range, numSummaryRows As Long
With ws
Set found = .Columns("A").Find("Career")
If found Is Nothing Then Exit Sub
numSummaryRows = .Cells(.Rows.Count, "A").End(xlUp).Row - found.Row
numSummaryRows = IIf(numSummaryRows = 0, 1, numSummaryRows + 1)
Debug.Print found.Offset(, 1).Resize(numSummaryRows, 30).Address, ws.Name
found.Offset(, 1).Resize(numSummaryRows, 30).Copy found.Offset(, 2)
found.Offset(, 1).Resize(numSummaryRows, 1).ClearContents
End With
End Subm html As New HTMLDocument, links(), link As Long, wsSourceSheet As Worksheet
Dim hTable As HTMLTable, ws As Worksheet, playerName As String
Set wsSourceSheet = ThisWorkbook.Worksheets("PlayerList")
Application.ScreenUpdating = False
With wsSourceSheet
links = .Range("C2:C" & .Cells(.Rows.Count, "C").End(xlUp).Row).Value
End With
For link = LBound(links, 1) To UBound(links, 1)
Is there a reason you need to do this with VBA? Excel is quite capable of importing well-organized data such as the [several] tables on that page.
Under the Data tab, click From Web and then enter the Website URL.
Click images to enlarge
Next you will choose the table(s) that you want. Don't go nuts - only get what you need, but you can choose more than one tables by enabling the checkbox.
It can take a few long minutes to parse and organize all the data on the page...
Once you're back at the worksheet you'll see the queries on the right side. Right-click a query and choose Load To..., then choose Table and a location for the table data. There are a ton of other properties that you can customize; there are tutorials describing what you can do.
More things to customize are hidden in two ribbon tabs that only appear when you click on a table, Design and Query.
I think there's also a way to just create a list of players and then to use the Advanced option when entering the URL to allow you to dynamically choose any player you want, while only adding the tables once... but I've never quite figured that part out yet.
I'm not a sports fan, but I assume the data will be changing throughout the season, and an advantage of using tables like this is that once you set up your worksheet how you want it, there are settings you can choose to auto-update every time you open the workbook, or on schedule, or manually, or never; whatever is appropriate.
Google "Excel web query" to find out more about the plethora of options available to you when using queries (aka: "Get & Transform") to extract and organize your data.
Perhaps this could be an alternative to consider instead of coding functionality that's already built-in to Excel.
Good luck, and "Go Sports!"
Yes there is a reason for doing this with VBA. In fact at least five.....
You don't manually have to set up it up for all the links, which if you have a very long list means you would end up having to turn to automation anyway;
On a related theme, powerquery has limitations on how many connections it can support and with NFL player lists you can easily go way beyond what is supported and end up, even when at the max number of connections allowed, with a workbook that crashes or grinds to a halt (I have been there!);
Both tables are not always present so the below has error handling to deal with that;
You get your player named sheets as before, and again error handling for if sheet already present;
Not all versions of powerquery have the nice interface which will allow you to select all the tables individually for these pages. My version of Excel 2016 basically offers only to select the entire page. In that case you have more data than you need and a slowed down process.
Whilst there may be ways to handle this with inbuilt tools, I love me a bit of powerquery, it is no longer "out of the box", but requires knowing how to code in M to some extent and/or reverting to using some VBA anyway.
If you tie this to a button on a sheet you can easily press to refresh when you want, link it to a workbook_open event to refresh on opening, even have windows scheduler open the workbook and refresh at certain times (just so you know VBA still got your back! Though maybe with a little help from my friends ♫ aka Windows).
It seems XHR is just a little too fast for the lower tables on each page, but do not despair, you could use Internet Explorer, with a short delay to ensure the Rushing & Receiving table is populated, or, as I have, use Selenium to automate the browser (I have used Chrome but Internet Explorer is possible). Although this is slower than XHR, we can be a little more efficient by running a headless browser instance.
Here you go with VBA which will give you the different tabs as you go and select only those tables required. Based on links in at C2 on sheet1.
Option Explicit
Public Sub GetInfo()
Dim d As New ChromeDriver
Dim html As New HTMLDocument, links(), link As Long, wsSourceSheet As Worksheet, clipboard As Object
Dim hTablePass As HTMLTable, hTableRushReceive As HTMLTable, ws As Worksheet, playerName As String
Set wsSourceSheet = ThisWorkbook.Worksheets("Sheet1") '<change to sheet containing links
Application.ScreenUpdating = False
With wsSourceSheet
If .Cells(.Rows.Count, "C").End(xlUp).Row = 2 Then
ReDim links(1 To 1, 1 To 1): links(1, 1) = .Range("C2")
Else
links = .Range("C2:C" & .Cells(.Rows.Count, "C").End(xlUp).Row).Value
End If
End With
For link = LBound(links, 1) To UBound(links, 1)
If InStr(links(link, 1), "https://") > 0 Then
With d
.AddArgument "--headless"
.get links(link, 1)
html.body.innerHTML = .PageSource
Set hTablePass = html.querySelector("#all_passing #passing")
Set hTableRushReceive = html.querySelector("#all_rushing_and_receiving #rushing_and_receiving")
playerName = GetNameAbbr(links(link, 1))
Set ws = AddPlayerSheet(playerName)
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
If Not hTablePass Is Nothing Then
clipboard.SetText Replace$(Replace$(hTablePass.outerHTML, "--></DIV>", vbNullString), "<!--", vbNullString)
clipboard.PutInClipboard
ws.Cells(GetLastRow(ws, 1), 1).PasteSpecial
End If
If Not hTableRushReceive Is Nothing Then
clipboard.SetText hTableRushReceive.outerHTML
clipboard.PutInClipboard
ws.Cells(GetLastRow(ws, 1) + 2, 1).PasteSpecial
End If
End With
End If
Next
Application.ScreenUpdating = True
End Sub
Public Function GetNameAbbr(ByVal url As String) As String
Dim tempArr() As String
tempArr = Split(url, "/")
GetNameAbbr = Left$(tempArr(UBound(tempArr)), 6)
End Function
Public Function AddPlayerSheet(ByVal playerName As String) As Worksheet
Dim ws As Worksheet
If SheetExists(playerName) Then
Application.DisplayAlerts = False
ThisWorkbook.Worksheets(playerName).Delete
Application.DisplayAlerts = True
End If
Set ws = ThisWorkbook.Worksheets.Add
ws.Name = playerName
Set AddPlayerSheet = ws
End Function
Public Function SheetExists(ByVal playerName As String) As Boolean '<== *#Rory
SheetExists = Evaluate("ISREF('" & playerName & "'!A1)")
End Function
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.Rows.Count, columnNumber).End(xlUp).Row
End With
End Function
References:
Microsoft HTML Object Library
Selenium Type Library
Selenium basic download:
https://github.com/florentbr/SeleniumBasic
*Function adapted from #Rory

Scraping data from a website with a dynamic array function - Excel VBA

I want to eventually create a function where I can specify a web page element and URL and populate all instances of that element down a column. But am currently only experiencing limited success with this function:
Sub GrabAnchorTags() '(URL As String) As Variant'
Dim objIE As InternetExplorer
Dim elem As Object
Set objIE = New InternetExplorer
objIE.Visible = False
objIE.navigate "http://example.com/"
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
Dim aRange As Range
Debug.Print objIE.document.getElementsByTagName("a").Length
For Each elem In objIE.document.getElementsByTagName("a")
Debug.Print elem
ActiveCell.Offset(x, y).Value = elem
ActiveCell.Offset(x, y + 1).Value = elem.textContent
x = x + 1
Next
objIE.Quit
Set objIE = Nothing
End Sub
I would like to be able to turn this successfully from a macro to a function.
Currently, it uses a for loop to populate the cells and I wonder if it's possible to accomplish the same thing using evaluate or something similar because the for loop is inefficient.
This function would need to live in a cell, reference a URL in another cell, and populate the cells bellow it with all elements of a type found on the page. I am currently working on the anchor tag.
Many other solutions I referenced used macros:
Scraping data from website using excel vba
Getting links url from a webpage excel vba
VBA – Web scraping with getElementsByTagName()
Generally speaking, whenever you have many cells to write to, you should enter the data into an internal array, and then write the entire array to the worksheet in one hit. However you seem to not want a macro/sub in your case.
If you wish it to take the worksheet formula approach for usability reasons, then the best way is to use a very powerful, but underused technique in Excel development.
A NAMED RANGE
Named ranges are Excels closest thing to getting an in-memory block of data, and then other simpler formulas can use the named range to get info from the Named Range.
A Named Range doesn't have to actually be a simple block of cells on a sheet. You can write your VBA formula as a Public formula, and then reference it in the Named Range.
Function getElems(url As String, tagName As String) As String()
Dim browser As New MSXML2.XMLHTTP60
Dim doc As MSHTML.HTMLDocument
With browser
.Open "GET", url, False
.send
If .readyState = 4 And .Status = 200 Then
Set doc = New MSHTML.HTMLDocument
doc.body.innerHTML = .responseText
Else
MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _
vbNewLine & "HTTP request status: " & .Status
End If
End With
Dim tag As MSHTML.IHTMLElement
Dim tags As MSHTML.IHTMLElementCollection
Set tags = doc.getElementsByTagName(tagName)
Dim arr() As String
Dim arrCounter As Long: arrCounter = 1
ReDim arr(1 To tags.Length, 1 To 2)
For Each tag In tags
arr(arrCounter, 1) = tag.innerText
'Change the below if block to suit
If tagName = "a" Then
arr(arrCounter, 2) = tag.href
Else
arr(arrCounter, 2) = tag.innerText
End If
arrCounter = arrCounter + 1
Next tag
Set doc = Nothing
Set browser = Nothing
getElems = arr
End Function
Now set a Named Range in Excel such as:
elementData
=getElems(Sheet1!$A$1, Sheet1!$B$1)
In A1, put the URL, and in B1 put the tag Name such as "a"
Then in your cells you can say
=INDEX(elementData, ROW(1:1), 1) and in adjacent cell put =INDEX(elementData, ROW(1:1), 2) (or use ROWS formula technique)
and drag down.

Resources