How to scrape location names from a website using VBA - excel

I am trying to scrape the pickup branch locations from a car rental website home page. The idea is to see exactly where pickup branches exist for a given company.
I have successfully created this before but this company revamped their website recently and now my code doesn't work. The branch locations seem to be hidden within a form of some sort, the locations only become visible in the html once you click on the pickup location space.
My current code looks as below:
Option Explicit
Private Sub pickuplocations()
Dim html As Object
Dim ws As Worksheet
Dim headers()
Dim i As Long
Dim r As Long
Dim c As Long
Dim numrows As Long
Set ws = ThisWorkbook.Worksheets("Europcar Branches(2)")
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.europcar.co.za", False
.send
html.body.innerHTML = .responseText 'fetches all html from the website
Dim pickupbranches As Object
Dim pickupbranchresults()
Set pickupbranches = html.getElementById("_location-search-widget_15").getElementsByTagName("span")
headers = Array("Pickup Location", "Option value") 'for the ws
numrows = pickupbranches.Length - 1 'sets the row length
ReDim pickupbranchresults(1 To numrows, 1 To 2) 'sets array size for the results
For i = 1 To numrows
pickupbranchresults(i, 1) = pickupbranches.Item(i).innerText
pickupbranchresults(i, 2) = pickupbranches.Item(i).Value
Next
With ws
.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers 'sets the column headers
.Cells(2, 1).Resize(UBound(pickupbranchresults, 1), UBound(pickupbranchresults, 2)) = pickupbranchresults
End With
End With
End Sub

Your current code requests the source HTML code and attempts to scrape it.
However, as explained in the comments, the list of locations is dynamically loaded when you click on the search bar and it is not a part of the page's source HTML. For this reason your code will yield no results.
It makes more sense to scrape the page dedicated to the locations:
https://www.europcar.co.za/rental-locations/
Now, if you navigate to this page and inspect the network traffic in your browser's developer tools (F12) when the page is loaded, you will see that an XHR request is being sent. It looks like so:
If you go through the Headers and the Params of the request you will see how the url, the body and the headers should look like. In this particular case, there are no parameters and the headers are not essential to the success of the request, so all you need is the url.
The response's payload is in json format. You can inspect its structure using a tool like this. Here's how it looks like:
Basically, the JSON consists of the different countries, each country consists of provinces and each province consists of the corresponding branches. Each branch consists of all the corresponding info.
To parse a response like that you need a JSON parser (look at the end of this post).
TL;DR
Here's how the code should look like:
Option Explicit
Sub getLocations()
Dim req As New WinHttpRequest
Dim url As String, results() As String
Dim sht As Worksheet
Dim responseJSON As Object, country As Object, province As Object, branch As Object
Dim i As Long
Dim rng As Range
Set sht = ThisWorkbook.Worksheets("Name of your Worksheet")
url = "https://www.europcar.co.za/api/rentalLocations/impressLocations"
With req
.Open "GET", url, False
.send
Set responseJSON = JsonConverter.ParseJson(.responseText)
End With
For Each country In responseJSON
For Each province In country("provinces")
i = 0
ReDim results(1 To province("branches").Count, 1 To 5)
For Each branch In province("branches")
i = i + 1
results(i, 1) = country("name")
results(i, 2) = province("name")
results(i, 3) = branch("name")
results(i, 4) = branch("emailAddress")
results(i, 5) = branch("contactNumber")
Next branch
With sht
Set rng = .Range("A" & .Rows.Count).End(xlUp).Offset(1, 0)
End With
rng.Resize(UBound(results, 1), UBound(results, 2)) = results
Next province
Next country
End Sub
For demonstration purposes the code above prints out the results in the following way:
Having in mind the JSON structure and the sample code I provided, you can easily modify it to fit your needs.
For the code to work you will need to add the following references to your project (VBE>Tools>References):
1. Microsoft WinHTTP Services version 5.1
2. Microsoft Scripting Runtime
You will also need to add this JSON parser to your project. Follow the installation instructions in the link and you should be set to go.

Related

I have questions about how to click and search on web using vba

I have questions about how to click and search on web using vba.
I have wrote the code, but cannot find how to click the button in this web
Sub LEISearch()
'dimension (declare or set aside memory for) our variables
Dim objIE As InternetExplorer 'special object variable representing the IE browser
Dim LEI As HTMLLinkElement 'special object variable for an <a> (link) element
Dim y As Integer 'integer variable we'll use as a counter
Dim result As String 'string variable that will hold our result link'
Dim result2 As String
'initiating a new instance of Internet Explorer and asigning it to objIE
Set objIE = New InternetExplorer
'make IE browser visible (False would allow IE to run in the background)
objIE.Visible = True
'navigate IE to this web page (a pretty neat search engine really)
objIE.navigate "https://www.gmeiutility.org/search.jsp?keyWord"
'wait here a few seconds while the browser is busy
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
'in the search box put cell "A2" value, the word "in" and cell "C1" value
objIE.document.getElementById("searchInput").Value = _
Sheets("Macro1").Range("A1").Value
'click the 'go' button
Set LEIButton = objIE.document.getElementsByClassName("hiddenSubmitButton")
LEIButton.Focus
LEIButton.Click
'wait again for the browser
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
End Sub
This case is a really nice and clean example of web scraping so i will take this opportunity to present an educational post.
I highly recommend avoiding using IE to scrape websites whenever it's possible. It's highly inefficient. Especially in a case like this where there can be multiple pages of results. Instead, you can use HTTP requests.
An HTTP request is a structured way to request something from a server. In this case we want to send a keyword to the server and get the corresponding search results.
To find out how this request should look like, you have to inspect the network traffic when the button with the magnifying glass is clicked. You can do that through your browser's developer tools (Ctrl+Shift+E if you're using Firefox):
If you go through the Headers and the Params of the request you will see how the url, the body and the headers should look like. In this particular case, all the parameters are encoded into the url and the headers are not essential to the success of the request, so all you need is the url.
Some of the parameters of the url are the keyword, the number of results per page and the number of page.
The response's payload is in json format. You can inspect its structure using a tool like this. Here's how it looks like:
Basically the JSON response consists of as many results as you have specified that should be displayed per page (or less). To get the next page you need to send a new request with the same keyword but specifying a new page number and so on.
In fact, as you can see the website offers a lot more data than what's displayed on your browser, which could prove to be useful.
The code below searches for the keyword test, while requesting 25 results per page. One first request is sent to find out how many pages of results are there and then the code loops through all pages and prints the results in a worksheet.
TL;DR
Option Explicit
Sub main()
Dim sht As Worksheet
Dim totalNumberOfPages As Long
Dim searchResults As Object
Dim pageNumber As Long
Dim results() As String
Dim entity As Object
Dim i As Long, j As Long
Dim rng As Range
Set sht = ThisWorkbook.Worksheets("Name of your Worksheet")
''''''First request to find out the number of pages''''''
Set searchResults = XHRrequest("test", 25, 1) '
totalNumberOfPages = searchResults("totalPages") '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''
'''''''''''''''''''Loop through all the pages''''''''''''''''''''''''''''''''
For pageNumber = 1 To totalNumberOfPages Step 1 '
Set searchResults = XHRrequest("test", 25, pageNumber) '
ReDim results(1 To searchResults("entitySearchResult").Count, 1 To 7) '
i = 0 '
'''''''''''write the results in an array'''''''''''''''''''''''''''' '
For Each entity In searchResults("entitySearchResult") ' '
i = i + 1 ' '
results(i, 1) = entity("LEINumber") ' '
results(i, 2) = entity("legalName") ' '
results(i, 3) = entity("city") ' '
results(i, 4) = entity("headquartersCountry") ' '
results(i, 5) = entity("recordStatus") ' '
results(i, 6) = entity("renewalStatus") ' '
results(i, 7) = entity("entityStatus") ' '
Next entity ' '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' '
'''''''''''''''write all the results in the worksheet in one go''''' '
With sht ' '
Set rng = .Range("A" & .Rows.Count).End(xlUp).Offset(1, 0) ' '
End With ' '
rng.Resize(UBound(results, 1), UBound(results, 2)) = results ' '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' '
Next pageNumber '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
End Sub
Public Function XHRrequest(ByVal key As String, ByVal resultsPerPage As Long, ByVal pageNumber As Long) As Object
Dim req As New WinHttpRequest
Dim url As String
url = "https://www.gmeiutility.org/actions/Search/?isPendingValidationChecked=true&isSearchAllLOUChecked=true&keyWord=" & key & "&page=" & pageNumber & "&resultsPerPage=" & resultsPerPage & "&searchType=baseSearch" 'build the URL according to the parameters
'''''''''Send the HTTP request'''''''''''''''''''''''''''''''
With req '
.Open "POST", url, False '
.send '
Set XHRrequest = JsonConverter.ParseJson(.responseText) '
End With '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
End Function
For demonstration purposes the code above prints all the data in worksheet named Name of your Worksheet.
If you need to perform multiple searches you can easily modify the code to best fit your needs. More specifically you can loop through multiple keywords and call the XHRrequest function using those keywords instead of "test".
Here's a sample of the output:
You will need to add the following references to your project (VBE>Tools>References):
Microsoft WinHTTP Services version 5.1
Microsoft HTML Objects Library
Microsoft Scripting Runtime
You will also need to add this JSON parser to your project. Follow the installation instructions in the link and you should be set to go.

Unable to "click" Span element in VBA/IE

I'm trying to automate a web scrape of the following URL: https://www.forebet.com/en/football-predictions-from-yesterday
I have code that pulls all of the data, however there is a More+ "button" which expands the list, and I can't seem to get a handle on how to send the click via VBA, as it's not a button and so I can't send use .Click
The HTML snippet for the section:
<tr id="mrows" style="height:57px;">
<td colspan="12"><span onclick="ltodrows("1x2","-1")">More [+]</span>
<div class="loader"></div>
</td>
</tr>
I've tried numerous ways of submitting the Click - but one have worked!
Start of my code:
Sub Button_More_Test()
Dim objIE As InternetExplorer
Dim itemEle As Object
'Dim e As Object
Set objIE = New InternetExplorer
objIE.Visible = True
objIE.Navigate "https://www.forebet.com/en/football-predictions-from-yesterday"
Do While objIE.Busy = True Or objIE.ReadyState <> 4: DoEvents: Loop
I see this has been answered but I also see that the total number of games is 586.
Now, don't get me wrong, it's perfectly fine to scrape the data using IE and the HTML document but I can only imagine what a nightmare it must be in terms of efficiency...
That's why I will post a method to circumvent IE and the More + button and all that.
More specifically the webpage provides a very convenient way to get the data in JSON format via an HTTP request.
To find out how this request should look like, you have to inspect the network traffic when the More+ button is clicked. You can do that through your browser's developer tools (Ctrl+Shift+E if you're using Firefox):
If you go through the Headers and the Params of the request you will see how the url, the body and the headers should look like. In this particular case, all the parameters are encoded into the url and the headers are not essential to the success of the request, so all you need is the url.
The response's payload is in json format. You can inspect its structure using a tool like this. Here's how it looks like:
Basically the JSON consists of the 586 games, each one of which consists of 41 parameters with their corresponding values.
TL;DR
Here's how the code should look like:
Option Explicit
Sub forebet()
'''''''Declarations''''''''''''''''''''''''''''''''''''''''''
Dim sht As Worksheet '
Dim req As New WinHttpRequest '
Dim jsonResponse As Object, game As Object '
Dim key As Variant '
Dim url As String '
Dim headers() As String, results() As String '
Dim i As Long, j As Long, lastRow As Long, lastCol As Long '
Dim rng As Range '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Set sht = ThisWorkbook.Worksheets("Name of your Worksheet") 'store your worksheet in a variable
url = "https://www.forebet.com/scripts/getrs.php?ln=en&tp=1x2&in=-1"
'''''''''''HTTP request''''''''''''''''''''''''''''''''''''''''''
With req '
.Open "GET", url, False '
.send '
Set jsonResponse = JsonConverter.ParseJson(.responseText) '
End With '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
ReDim headers(1 To 1, 1 To jsonResponse(1).Count)
ReDim results(1 To jsonResponse.Count, 1 To jsonResponse(1).Count)
''''''''Write the headers to an array''''''''
i = 1 '
For Each key In jsonResponse(1).Keys '
headers(1, i) = key '
i = i + 1 '
Next key '
'''''''''''''''''''''''''''''''''''''''''''''
''''''''''''Write the data to an array'''''''''''''''''''''''''''''''''''''''
j = 1 '
For Each game In jsonResponse '
i = 1 '
For Each key In game.Keys '
If game(key) <> "null" Then '
results(j, i) = game(key) '
i = i + 1 '
Else '
results(j, i) = "NULL" '
i = i + 1 '
End If '
Next key '
j = j + 1 '
Next game '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
''''''''Write the headers and results arrays to the worksheet in one go''''''''''''''''''''''''''
sht.Range("A1").Resize(1, UBound(headers, 2)) = headers '
sht.Range("A2").Resize(UBound(results, 1), UBound(results, 2)) = results '
'sht.Cells.Value = sht.Cells.Value this line causes an out of memory error on 32 bit office ' '
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
'''''''''''Prevent numbers form being stored as text''''''''''''
With sht '
lastRow = .Cells(.Rows.Count, 1).End(xlUp).Row '
lastCol = .Cells(1, .Columns.Count).End(xlToLeft).Column '
Set rng = Range(.Range("A1"), .Cells(lastRow, lastCol)) '
End With '
Debug.Print rng.Address '
rng.Value = rng.Value '
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
End Sub
In fact, as you can see the website offers a lot more data, which could prove to be useful. For demonstration purposes the code above prints all the data in worksheet named Name of your Worksheet.
Here's a sample of the output:
You will need to add the following references to your project (VBE>Tools>References):
Microsoft WinHTTP Services version 5.1
Microsoft HTML Objects Library
Microsoft Scripting Runtime
You will also need to add this JSON parser to your project. Follow the installation instructions in the link and you should be set to go.
You can loop through elements of the website like so:
For Each ele In objIE.Document.getElementsByTagName("span")
If ele.innerText = "More [+]" Then
ele.Click
Exit For
End If
Next
You'll have to wait until it loads, IE.state won't change but you get body before click, and loop until it changes, for example.

Web scraping with excel VBA returns no values for certain sites

I'm trying to scrape data from Oslo Børs related to 3months NIBOR using excel VBA. However my code do not return any values back to excel. The same code is able to retrieve table data from other similar websites, but could it be related to the ui-view set-up of the html language? I'm not very skilled on html language, so I do come to short here. Any help would be much appreciated. Thanks
Sub NIBOR3M_oslobors()
Dim ie As InternetExplorer
Dim InternetExplorer As Object
Dim strURL As String
Dim ws As Worksheet
Dim rng As Range
Dim tbl As Object
Dim rw As Object
Dim cl As Object
Dim nextrow As Long
Dim I As Long
strURL = "https://www.oslobors.no/markedsaktivitet/#/details/NIBOR3M.NIBOR/overview"
Set ie = CreateObject("InternetExplorer.Application")
With ie
.navigate strURL
Do Until .READYSTATE = 4: DoEvents: Loop
Do While .Busy: DoEvents: Loop
Set doc = ie.Document
Set ws = Worksheets("Interest rates")
For Each tbl In doc.getElementsByTagName("table")
nextrow = nextrow + 1
Set rng = ws.Range("M8:M8" & nextrow)
For Each rw In tbl.Rows
For Each cl In rw.Cells
rng.Value = rw.innerText
Set rng = rng.Offset(, 1)
I = I + 1
Next cl
nextrow = nextrow + 1
Set rng = rng.Offset(1, -I)
I = 0
Next rw
Next tbl
.Quit
End With
ActiveSheet.Range("M8").Select
End Sub
The tables you are trying to scrape are generated by scripts, so you will not find them in the HTML document. To get your hands on the data you will need to send an XHR request.
To find out how this request should look like, you have to inspect the network traffic when the page is loaded. You can do that through your browser's developer tools (Ctrl+Shift+E if you're using Firefox).
There you will see several requests of different types being sent. In your case, you should look for XHR requests of type json. There's a few of them. If you go through the responses of these requests, you will eventually find the one you need, as it will contain the data you want.
Here's how it looks like:
If you go through the Headers and the Params of the request you will see how the url, the body and the headers should look like. In this particular case, all the parameters are encoded into the url and the headers are not essential to the success of the request, so all you need is the url.
The response's payload is in json format. You can inspect its structure using a tool like this. Here's how it looks like:
To parse a response like that you need a parser (look at the end of this post).
TL;DR :
Here's how to get the data from the "Nibor 3 month" table:
Option Explicit
Sub oslobors()
Dim req As New WinHttpRequest
Dim respJSON As Object
Dim key As Variant
Dim url As String
Dim results()
Dim i As Integer
Dim sht As Worksheet
Set sht = ThisWorkbook.Worksheets("Name of your worksheet")
url = "https://www.oslobors.no/ob/servlets/components?type=nibor&source=feed.index.nibor.INDICES_FIXING&view=DELAYED&columns=ITEM%2C+CLOSENZ%2C+TRADE_TIME%2C+CLOSE_DATE%2C+CHANGENZ%2C+CHANGENZ_PCT&filter=ITEM_SECTOR%3D%3DsNIBOR3M.NIBOR&channel=e6da3e88c3464b7ad9620824b8d39c95"
With req
.Open "GET", url, False
.send
Set respJSON = JsonConverter.ParseJson(.responseText)
End With
ReDim results(1 To respJSON("rows")(1)("values").Count, 1 To 2)
i = 1
For Each key In respJSON("rows")(1)("values").Keys
results(i, 1) = key
results(i, 2) = respJSON("rows")(1)("values")(key) 'write the results in an array
i = i + 1
Next key
sht.Cells(1, "A").Resize(UBound(results, 1), UBound(results, 2)) = results 'Print the results array in one go. The results will be printed to an area/range starting with cell A1 and expanding as much as it is needed
End Sub
For demonstration purposes, the data is printed in your immediate window:
You will need to add the following references to your project (VBE>Tools>References):
Microsoft WinHTTP Services version 5.1
Microsoft HTML Objects Library
Microsoft Scripting Runtime
You will also need to add this JSON parser to your project. Follow the installation instructions in the link and you should be set to go.
To get the data from the "Historical Prices" table you just have to change the url to this:
url = "https://www.oslobors.no/ob/servlets/components?type=table&source=feed.index.nibor.INDICES_FIXING&view=DELAYED&columns=CHANGE_1WEEK_PCT%2C+HIGH_1WEEK%2C+LOW_1WEEK%2C+CHANGE_1MONTH_PCT%2C+HIGH_1MONTH%2C+LOW_1MONTH%2C+CHANGE_YEAR_PCT%2C+HIGH_YEAR%2C+LOW_YEAR%2C+CHANGE_1YEAR_PCT%2C+HIGH_1YEAR%2C+LOW_1YEAR&filter=ITEM_SECTOR%3D%3DsNIBOR3M.NIBOR&channel=283044a7d182ca196a16337ba79f089c"
EDIT
Modified the code to print the results in a worksheet named "Name of your worksheet".

Is there a way to slow down a Web Scraper so it will pick up the code?

I wrote a macro to go to WU to get historical data and for the most part, it works. However, I believe that the macro is running too fast for it to pick up the data from the website.
https://www.wunderground.com/history/daily/us/tx/el-paso/KELP/date/2017-1-3
Is the website and the table I want to get is tablesaw-sortable.
I have tried the following: DoEvents and Application.Wait (Now + TimeValue("00:00:01")) to try to slow down the process.
Sub BrowseToWU()
Dim IE As New SHDocVw.InternetExplorer
Dim HTMLDoc As MSHTML.HTMLDocument
Dim RowAddress As Integer
Dim WebAddress As String
Dim DateSheet As Date
Dim WkDay As Integer
Dim DateSheetName As String
'Application.ScreenUpdating = False
'Application.StatusBar = True
RowAddress = 2
IE.Visible = True
Do Until RowAddress = 60
WebAddress = Range("A" & RowAddress)
DateSheet = Right(WebAddress, 8)
DateSheetName = Right(WebAddress, 8)
WkDay = Weekday(DateSheet, vbSunday)
If WkDay < 3 Then
RowAddress = RowAddress + 1
ElseIf WkDay > 6 Then
RowAddress = RowAddress + 1
Else
IE.Navigate WebAddress
Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop
Set HTMLDoc = IE.Document
DoEvents
Application.Wait (Now + TimeValue("00:00:05"))
DoEvents
ProcessHTMLPage HTMLDoc
DateSheet = Right(WebAddress, 8)
DoEvents
Application.Wait (Now + TimeValue("00:00:01"))
ActiveSheet.Name = DateSheetName
DoEvents
RowAddress = RowAddress + 1
'IE.Quit
Worksheets("Sheet1").Activate
End If
Loop
End Sub
Option Explicit
Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)
Dim HTMLTable As MSHTML.IHTMLElement
Dim HTMLTables As MSHTML.IHTMLElementCollection
Dim HTMLRow As MSHTML.IHTMLElement
Dim HTMLCell As MSHTML.IHTMLElement
Dim RowNum As Long, ColNum As Integer
'Dim IE As New SHDocVw.InternetExplorer
'Dim Ws As Worksheet
Set HTMLTables = HTMLPage.getElementsByClassName("tablesaw-sortable")
'DoEvents
For Each HTMLTable In HTMLTables
Worksheets.Add
DoEvents
Range("A1").Value = HTMLTable.className
Range("B1").Value = Now
RowNum = 2
For Each HTMLRow In HTMLTable.getElementsByTagName("tr")
'Debug.Print vbTab & HTMLRow.innerText
ColNum = 1
For Each HTMLCell In HTMLRow.Children
Cells(RowNum, ColNum) = HTMLCell.innerText
ColNum = ColNum + 1
Next HTMLCell
RowNum = RowNum + 1
Next HTMLRow
Next HTMLTable
DoEvents
'IE.Quit
End Sub
The macro is supposed to run through sheet1 picking up the web address to the historical data if it satisfies the criteria of being a certain day of the week.
IE will open and then it will kick over to the next module that will take in the data.
A new worksheet is created and the data pasted into the new worksheet.
The worksheet is renamed to the date of the data.
The web address sheet is activated again and the process starts over again.
The error I get is that the data isn't taken from the website, so the For statement ends and the web address sheet is renamed and an error occurs.
One way around this is to call the API that the page is using to get that info.
The API returns json which you can parse with a json parser. I use jsonconverter.bas. After installing the code from that link in a standard module called JsonConverter, go to VBE > Tools > References > Add a reference to Microsoft Scripting Runtime.
Finding the API:
If you press F12 to open developer tools and go to the Network tab and then press F5 to refresh any url of interest you will see the recorded web traffic. You can find the API call there.
See my answer here on how to search the network traffic using a specific observation value you expect to see in the response - this will filter the list of network traffic to those items containing the value of interest. Be sensible in selecting the value - you want something unlikely to occur elsewhere. You can also filter the network traffic to XHR only.
The API response:
The API returns json. More specifically, it returns a dictionary containing 2 keys. The second key, "observations", can be used to return a collection (denoted by []) of dictionaries (denoted by {}).
Each dictionary represents a row of the table (daily observations). You can loop this collection, and then loop the inner dictionaries, to access the table row values and reconstruct the table by populating an array. Explore example json response here.
Explanation of json structure:
click here to enlarge
Explanation of code:
The code is broken down into a number of helper subs and functions, allocating certains tasks to each, to
make code easier to debug and follow, as well as better align with Object Oriented Programming Principles.
Overall the process is:
Gather urls for Worksheet("Sheet1"). Helper function GetAllUrls.
Process those urls and only retain the dates which correspond with Tue-Thur. These are kept as strings formatted as "yyyymmdd" so can be passed to API later. This is handled by helper functions GetOnlyQualifyingUrlsDates and IncludeThisDate. IncludeThisDate performs the check for whether to include; GetOnlyQualifyingUrlsDates handles the looping and formatting of results.
Issue xmlhttp requests by looping over qualifying url dates and concatenating those into the url for the API call, then issuing the request. This is performed by the main sub GetTables.
Sheet creation, for output, is handled by helper function CreateWorksheet. This function calls another helper function, SheetExists, to ensure sheets are only created if they don't already exist, otherwise, the existing sheet by that name is used.
The resultant json response, from step 3, is passed to a helper sub WriteOutResults which accepts the json variable and the output sheet object as arguments. It extracts all the info from the json response; essentially reconstructing the table. It adds the table and headers to the appropriate sheet.
It calls helper function Epoch2Date, which handles the unix timestamp to datetime conversion for the two unix fields in the json object.
TODO:
The API key may be time limited. Add a helper function which returns the current valid key.
The API accepts start date and end date parameters in the url construct. It would be far better to issue one request for the entire range if possible, or chunked ranges e.g. months, to reduce the number of requests made. This would also reduce the likelihood of being blocked. This would mean some additional code would need to be written, before writing out results, to ensure only dates of interest are being written to sheets. Though you could write out all then simply loop all sheets and delete those that aren't wanted (perfectly doable if we are talking about 365 dates total). Personally, I would handle the include date part in the construction of the table from a single request (if possible) that has the min and max dates for entire urls listed passed as start and end date parameters. I would then write a single flat table out to one sheet as this will be much easier for later data analysis.
VBA:
Option Explicit
Public Sub GetTables()
'VBE > Tools > References > Microsoft Scripting Runtime
Dim json As Object, qualifyingUrlsDates(), urls(), url As String
Dim ws As Worksheet, wsOutput As Worksheet, i As Long, startDate As String, endDate As String
Set ws = ThisWorkbook.Worksheets("Sheet1")
urls = GetAllUrls(2, ws, "A")
qualifyingUrlsDates = GetOnlyQualifyingUrlsDates(urls)
'API key may be not be valid over time so look at obtaining by prior request
With CreateObject("MSXML2.XMLHTTP") 'issue xmlhttp request for each valid date (this would be better done using start and enddate to specify entire range _
of batches e.g. months within total range to cut down on requests
For i = LBound(qualifyingUrlsDates) To UBound(qualifyingUrlsDates)
startDate = qualifyingUrlsDates(i)
endDate = startDate ' a little verbose but useful for explaining
url = "https://api.weather.com/v1/geocode/31.76/-106.49/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&startDate=" & startDate & "&endDate=" & endDate & "&units=e"
.Open "GET", url, False
.send
Set json = JsonConverter.ParseJson(.responseText)("observations")
Set wsOutput = CreateWorksheet(qualifyingUrlsDates(i))
WriteOutResults wsOutput, json
Next
End With
End Sub
Public Sub WriteOutResults(ByVal wsOutput As Worksheet, ByVal json As Object)
'json is a collection of dictionaries. Each dictionary is a time period reading from the day i.e. one row in output
Dim results(), item As Object, headers(), r As Long, c As Long, key As Variant
headers = json.item(1).keys 'get the headers which are the keys of each dictionary
ReDim results(1 To json.Count, 1 To UBound(headers) + 1)
For Each item In json
r = r + 1: c = 0 'increase row in results array to store results for table row
For Each key In item.keys
c = c + 1 'increase column number in results array for writing out results
Select Case key
Case "valid_time_gmt", "expire_time_gmt" 'convert unix timestamp fields to datetime
results(r, c) = Epoch2Date(item(key))
Case Else
results(r, c) = item(key)
End Select
Next
Next
With wsOutput
.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Public Function GetOnlyQualifyingUrlsDates(ByVal urls As Variant) As Variant
Dim i As Long, output(), counter As Long
ReDim output(1 To UBound(urls))
For i = LBound(urls) To UBound(urls)
If IncludeThisDate(urls(i)) Then 'check if weekday is to be included
counter = counter + 1
output(counter) = Format$(Right$(urls(i), 8), "yyyymmdd") 'if to include then add to output array of urls of interest
End If
Next
ReDim Preserve output(1 To counter)
GetOnlyQualifyingUrlsDates = output
End Function
Public Function IncludeThisDate(ByVal url As String) As Boolean
'tue, wed, thurs are valid
IncludeThisDate = Not IsError(Application.Match(Weekday(Right$(url, 8), vbSunday), Array(3, 4, 5)))
End Function
Public Function SheetExists(ByVal sheetName As String) As Boolean '<== function by #Rory
SheetExists = Evaluate("ISREF('" & sheetName & "'!A1)")
End Function
Public Function GetAllUrls(ByVal startRow As Long, ByVal ws As Worksheet, ByVal columnName As String) As Variant
'transpose used based on premise no more than a couple of years of dates
'startRow is start row for urls, ws is sheet where urls found, columnName is string representation of column for urls e.g. "A"
With ws
GetAllUrls = Application.Transpose(ws.Range("A" & startRow & ":A" & .Cells(.rows.Count, columnName).End(xlUp).Row).Value)
End With
End Function
Public Function CreateWorksheet(ByVal sheetName As String) As Worksheet
Dim ws As Worksheet
If SheetExists(sheetName) Then
Set ws = ThisWorkbook.Worksheets(sheetName)
'do something.... clear it? Then add new data to it?
Else
Set ws = ThisWorkbook.Worksheets.Add
ws.Name = sheetName
End If
Set CreateWorksheet = ws
End Function
Public Function Epoch2Date(ByVal E As Currency, Optional msFrac) As Date '# Schmidt http://www.vbforums.com/showthread.php?805245-EPOCH-to-Date-and-vice-versa
Const Estart As Double = #1/1/1970#
msFrac = 0
If E > 10000000000# Then E = E * 0.001: msFrac = E - Int(E)
Epoch2Date = Estart + (E - msFrac) / 86400
End Function

VBA Code Scraper not placing data in right columns

The code works fine, but I need it to extract ONLY emails and URLs and place the email in Sheet1 "Scraper" NEXT BLANK ROW
Emails = Column A
Urls = Column B
Currently it extracts anything text, emails or URL and places them in column A or B.
I only need Emails or URLs. I have been stuck on this for sometime and can't seem to work it out
Also I am not sure if my DELETE DUPLICATES is deleting duplicate rows or duplicates in column. It SHOULD be duplicate rows.
How the code works:
On Sheet2 "URL List" I have a list of URLs, the code runs through this and places the results onto Sheet1 "Scraper". and deletes any duplicates
It is only supposed to scraper email and URLs and place them in Column A,B on NEXT BLANK ROW.
I have tried to fix the problem but it is out of my scope.
Private Sub fbStart_Click()
'Set sheet2 URL List and open Internet Explorer
Dim lr As Long
Dim x As Long
Dim arr() As Variant
Dim wks As Worksheet
Dim ie As Object
Dim dd(1 To 2) As String
Dim Fr As Long
On Error Resume Next
Application.ScreenUpdating = False
Set wks = ThisWorkbook.Sheets("Url List")
With wks
Fr = .Cells(.Rows.Count, 6).End(xlUp).Offset(1).Row
lr = .Cells(.Rows.Count, 1).End(xlUp).Row
.Cells(1, 5).Value = lr
arr = .Range(.Cells(Fr, 1), .Cells(lr, 1)).Value
End With
'Show Internet Explorer and add delay in seconds if needed
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = True
Application.Wait Now + TimeValue("0:00:0")
For x = LBound(arr, 1) To UBound(arr, 1)
.navigate arr(x, 1)
wtime = Time
Do While .Busy Or .readyState <> 4
DoEvents
'Skip pages with Captchas + write the word Captcha in Sheet 2 Column C
If Time > (wtime + TimeValue("00:00:10")) Then
Cells(x + 1, "C").Value = "Captcha"
Exit Do
End If
Loop
On Error Resume Next
'Variable for document or data which need to be extracted out of webpage, change innertext number if same class used
Dim doc As HTMLDocument
Set doc = ie.document
dd(1) = doc.getElementsByClassName("_50f4")(2).innerText
dd(2) = doc.getElementsByClassName("_50f4")(3).innerText
'Paste the web data into Sheet1 "Scraper" in next BLANK ROW
With Sheet1
.Cells(.Rows.Count, "A").End(xlUp).Offset(1, 0).Resize(, 2).Value = dd
End With
' Put A number 1 in Sheet2 "Url List"column B to notify this URL is done
Sheets("Url List").Range("B" & Rows.Count).End(xlUp).Offset(1, 0).Value = 1
'Deletes duplicates in column A Sheet1
Columns(1).RemoveDuplicates Columns:=Array(1)
Columns(2).RemoveDuplicates Columns:=Array(1)
'Count No1 in sheet2 Column B
With Worksheets("Url List")
Lastrow = .Cells(.Rows.Count, "B").End(xlUp).Row
Sheets("Url List").Range("B1").Value = Lastrow
End With
Call Autoclick_Click
Next x
.Quit
End With
'Hide FaceBook Scraper Form
ScraperForm.Hide
End Sub
Below is to show you how to handle finding email and website address. You already have your loop and de-duplicate. Below are helper methods for extracting the required info. You can simply assign from the variables email and website to your cells in loop. I show a method using a helper function to determine lastRow in target sheet and writing out variables to correct columns in one go.
I can help with implementing the loop integration if needed, but the emphasis here was on explaining what could be done for identifying those elements of interest and how to write out to correct columns. Tbh - de-duplicating is so easily done in sheet at end but you can also use macro recorder to get perfectly functional code for that single step/use existing SO answers.
tl;dr;
This would be a lot easier if :contains / :has css pseudo classes were permitted. Instead, my approach is as follows:
email - find the href attribute whose value starts with mailto
website - check that there is a website icon on the page
Specify a parent of both the website icon and the website address
Loop all matches to that parent specification checking if contains the website icon (this is where pseudo class selectors would have simplified things). If match found then we have the shared parent of both icon and hopefully website address; use childOfSiblingCssSelector (we are looking at a child of the following div in this case) css selector to extract the website url.
Notes:
The entire thing is kept quite high level/generic such that you can adjust your css selectors to hopefully cater for different scenarios. Consequence - may seem a little verbose.
Helper functions are provided to handle element matching. Name these in a way that makes sense for what they are doing. I think some room for improvement here.
Whilst technically the second helper, GetText, could be used to extract the email address (I'd probably add another argument to function call to specify attribute to extract) as well as website address, it seems far quicker, currently, to simply target the appropriate href as detailed above.
I have kept the css selectors as local variables close to their usage; you could have them as constants, closer to top of module, where easier to access perhaps? Unsure without seeing how this performs over time/different urls
Css selectors are chosen over .getElementsBy methods for two reasons: 1) there is browser optimization for css selectors so, if well formulated, css will be faster 2) I want to preserve the flexibility of the code/helper functions - you have far more specificity with css selectors in terms of what patterns you can express. I deemed this important as I don't know what future cases you may need to handle.
I am deliberately not using class name and index e.g. doc.getElementsByClassName("_50f4")(2).innerText, as I am unfamiliar with the range of potential use cases; this feels a little fragile as relies on consistent ordering and numbering of elements (at least up to these indices).
TODO:
Rather than instantiate a new HTMLDocument each time in GetText, it is more efficient to pass another HTMLDocument argument in the function signature i.e. from calling procedure. A re-factor could take that into consideration.
This type of coding might lend itself to being class based in the future. Particularly if error handling is to be added and further functions.
VBA:
Option Explicit
'VBE > Tools > References > HTML Object Library
Public Sub test()
Dim ie As Object, ws As Worksheet
Set ie = CreateObject("InternetExplorer.Application")
Set ws = ThisWorkbook.Worksheets("Scraper")
With ie
.Visible = True
.Navigate2 "https://www.facebook.com/pg/SalemFordNH/about/?ref=page_internal%5Blink%5D"
While .Busy Or .readyState < 4: DoEvents: Wend
With .document
Dim email As String, website As String, iconCssSelector As String
'iconCssSelector for website icon in this instance
iconCssSelector = "[src='https://static.xx.fbcdn.net/rsrc.php/v3/yV/r/EaDvTjOwxIV.png']"
If ElementIsPresent(ie.document, "[href^=mailto]") Then
email = ie.document.querySelector("[href^=mailto]").innerText
Else
email = "Not found"
End If
Dim parents As Object, sharedParentCssSelector As String, childOfSiblingCssSelector As String
sharedParentCssSelector = "._5aj7" 'target parent of both icon and the website link
childOfSiblingCssSelector = "._50f4" '< to target website address after finding right parent
If ElementIsPresent(ie.document, iconCssSelector) _
And ElementIsPresent(ie.document, sharedParentCssSelector) Then
Set parents = ie.document.querySelectorAll(sharedParentCssSelector) 'css selector used to allow for greater flexibility in element matching
website = GetText(ie.document, parents, iconCssSelector, childOfSiblingCssSelector)
Else
website = "Not found"
End If
End With
'Assumes headers already present
Dim nextRow As Long
nextRow = GetLastRow(ws, 1) + 1
ws.Cells(nextRow, 1).Resize(1, 2) = Array(email, website)
.Quit
End With
End Sub
Public Function ElementIsPresent(ByVal document As HTMLDocument, ByVal cssSelector As String) As Boolean
ElementIsPresent = document.querySelectorAll(cssSelector).length > 0
End Function
Public Function GetText(ByVal document As HTMLDocument, ByVal parents As Object, ByVal iconCssSelector As String, ByVal childOfSiblingCssSelector As String) As String
'in this instance and with microsoft IE DOM you cannot select for parent of an element with pseudo class _
of :has(>child); nor use :contains... instead pass expected parent selector, that houses _
both the icon element for website and the website address itself, and loop all matches checking for website icon _
if found use childOfSiblingCssSelector to extract
Dim i As Long, html As HTMLDocument
Set html = New HTMLDocument
For i = 0 To parents.length - 1
html.body.innerHTML = parents.item(i).innerHTML
If ElementIsPresent(html, iconCssSelector) Then
GetText = html.querySelector(childOfSiblingCssSelector).innerText
Exit Function
End If
Next
GetText = "Not found"
End Function
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
End With
End Function
Project references (VBE > Tools > References):
Microsoft HTML Object Library
Additional reading:
https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll
https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector
Edit:
Example of loop - assumes no empty rows in column A between urls.
Option Explicit
'VBE > Tools > References > HTML Object Library
Public Sub test()
Dim ie As Object, ws As Worksheet, wsUrls As Worksheet, urls()
Set ie = CreateObject("InternetExplorer.Application")
Set ws = ThisWorkbook.Worksheets("Scraper")
Set wsUrls = ThisWorkbook.Worksheets("Url List")
With wsUrls
urls = Application.Transpose(.Range("A2:A" & .Cells(.rows.Count, "A").End(xlUp).Row).Value)
End With
Dim results(), r As Long
ReDim results(1 To UBound(urls), 1 To 2)
With ie
.Visible = True
For r = LBound(urls) To UBound(urls)
.Navigate2 urls(r)
While .Busy Or .readyState < 4: DoEvents: Wend
With .document
Dim email As String, website As String, iconCssSelector As String
'iconCssSelector for website icon in this instance
iconCssSelector = "[src='https://static.xx.fbcdn.net/rsrc.php/v3/yV/r/EaDvTjOwxIV.png']"
If ElementIsPresent(ie.document, "[href^=mailto]") Then
email = ie.document.querySelector("[href^=mailto]").innerText
Else
email = "Not found"
End If
Dim parents As Object, sharedParentCssSelector As String, childOfSiblingCssSelector As String
sharedParentCssSelector = "._5aj7" 'target parent of both icon and the website link
childOfSiblingCssSelector = "._50f4" '< to target website address after finding right parent
If ElementIsPresent(ie.document, iconCssSelector) _
And ElementIsPresent(ie.document, sharedParentCssSelector) Then
Set parents = ie.document.querySelectorAll(sharedParentCssSelector) 'css selector used to allow for greater flexibility in element matching
website = GetText(ie.document, parents, iconCssSelector, childOfSiblingCssSelector)
Else
website = "Not found"
End If
End With
'Assumes headers already present
Dim nextRow As Long
results(r, 1) = email
results(r, 2) = website
Next
.Quit
End With
nextRow = GetLastRow(ws, 1) + 1
ws.Cells(nextRow, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub
Public Function ElementIsPresent(ByVal document As HTMLDocument, ByVal cssSelector As String) As Boolean
ElementIsPresent = document.querySelectorAll(cssSelector).length > 0
End Function
Public Function GetText(ByVal document As HTMLDocument, ByVal parents As Object, ByVal iconCssSelector As String, ByVal childOfSiblingCssSelector As String) As String
'in this instance and with microsoft IE DOM you cannot select for parent of an element with pseudo class _
of :has(>child); nor use :contains... instead pass expected parent selector, that houses _
both the icon element for website and the website address itself, and loop all matches checking for website icon _
if found use childOfSiblingCssSelector to extract
Dim i As Long, html As HTMLDocument
Set html = New HTMLDocument
For i = 0 To parents.length - 1
html.body.innerHTML = parents.item(i).innerHTML
If ElementIsPresent(html, iconCssSelector) Then
GetText = html.querySelector(childOfSiblingCssSelector).innerText
Exit Function
End If
Next
GetText = "Not found"
End Function
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
End With
End Function

Resources