Pulling specific table cells from Morningstar, then looping to next Morningstar page - excel

I am currently trying to scrape certain pieces of data from a table on Morningstar, then have it loop to the next ticker and repeat until there are no more tickers.
Currently, it will pull the entire "rank in category" row on the Trailing Total Returns table. I'm simply trying to pull the 3 month, 6 month, YTD, 1 year, 3 year, and 5 year. When it's done pulling those, it will loop to the next ticker as determined by the "Cells(p, 14)" in the navigate line.
ie. It detects "LINKX" is in cell 1, 14 so it navigates to http://performance.morningstar.com/fund/performance-return.action?t=LINKX&region=usa&culture=en_US and pulls all of the "Rank in Category" lines from "trailing total returns" table. I only want the specified ones put into specified cell locations, then loop to the next ticker.
I've browsed through many of these threads, using excel VBA I am trying to pull key specific info from a certain tickers page, then loop to next ticker and repeat.
Declare PtrSafe Function apiShowWindow Lib "user32" Alias "ShowWindow" _
(ByVal hwnd As LongPtr, ByVal nCmdShow As LongPtr) As LongPtr
Global Const SW_MAXIMIZE = 3
Global Const SW_SHOWNORMAL = 1
Global Const SW_SHOWMINIMIZED = 2
Sub LinkedInWebScrapeScript()
Dim objIE As InternetExplorer
Dim html As HTMLDocument
Set objIE = GetObject("new:{D5E8041D-920F-45e9-B8FB-B1DEB82C6E5E}")
objIE.Visible = 1
Dim p As Integer
p = 3
objIE.navigate ("http://performance.morningstar.com/fund/performance-return.action?t=" & Cells(p, 14) & "&region=usa&culture=en_US")
Application.Wait Now + #12:00:02 AM#
While objIE.Busy
DoEvents
Wend
apiShowWindow objIE.hwnd, SW_MAXIMIZE
For i = 1 To 2
objIE.document.parentWindow.scrollBy 0, 100000 & i
Application.Wait Now + #12:00:01 AM#
Next i
Dim TDelements As IHTMLElementCollection
Dim htmldoc As MSHTML.IHTMLDocument 'Document object
Dim eleColtr As MSHTML.IHTMLElementCollection 'Element collection for tr tags
Dim eleColtd As MSHTML.IHTMLElementCollection 'Element collection for td tags
Dim eleColtd1 As MSHTML.IHTMLElementCollection
Dim eleRow As MSHTML.IHTMLElement 'Row elements
Dim eleCol As MSHTML.IHTMLElement 'Column elements
Set htmldoc = objIE.document 'Document webpage
Set eleColtr = htmldoc.getElementsByTagName("tr") 'Find all tr tags
Set TDelements = htmldoc.getElementsByTagName("table")
'This section populates Excel
i = 0 'start with first value in tr collection
Set eleColtd = htmldoc.getElementsByClassName("r_table3 width955px print97")(0).getElementsByClassName("last")(0).getElementsByClassName("row_data divide") 'get all the td elements in that specific tr
For Each eleCol In eleColtd 'for each element in the td collection
Sheets("Sheet2").Range("A1").Offset(i, j).Value = eleCol.innerText 'paste the inner text of the td element, and offset at the same time
j = j + 1 'move to next element in td collection
Next eleCol 'rinse and repeat
i = i + 1
p = p + 1
objIE.navigate ("http://performance.morningstar.com/fund/performance-return.action?t=" & Cells(p, 14) & "&region=usa&culture=en_US")
Set eleColtd = htmldoc.getElementsByClassName("r_table3 width955px print97")(0).getElementsByClassName("last")(0).getElementsByClassName("row_data divide") 'get all the td elements in that specific tr
For Each eleCol In eleColtd 'for each element in the td collection
Sheets("Sheet2").Range("A1").Offset(i, j).Value = eleCol.innerText 'paste the inner text of the td element, and offset at the same time
z = z + 1
j = j + 1 'move to next element in td collection
Next eleCol 'rinse and repeat
End Sub
It will pull the entire "rank in category" row on the Trailing Total Returns table. I'm simply trying to pull the 3 month, 6 month, YTD, 1 year, 3 year, and 5 year. When it's done pulling those, it will loop to the next ticker as determined by the "Cells(p, 14)" in the navigate line.

The following shows a loop and how to select the appropriate table, tbody then table cells using css selectors. Tickers are read into an array from column N starting at row 1. It assumes there are not blank cells within the range (though you could add a test to be sure).
There is a loop over the array, which contains each ticker, and the TICKER placeholder in the url is replaced with the current ticker value.
There is a line to click on the monthly display tab.
The appropriate row is identified via
Set rankings = .querySelectorAll("#tab-month-end-content .last td")
#tab-month-end-content is an id selector which gets the right tab, then .last is the class selector for the class name of the last tbody (which is last), then td is used to specify the child td cells within that tbody.
CSS selectors:
Modern browsers are optimized for css. Css selectors are a fast way to match on elements in an html document. Css selectors are applied via querySelector or querySelectorAll methods; in this case, of HTMLDocument (ie.document). querySelector returns a single node: the first match for the css selector; querySelectorAll returns a nodeList of all matched items - you then index into that nodeList to get specific items e.g. the second td cell is at index 1.
Looking at the pattern we specified:
#tab-month-end-content .last td
The first part is an id selector, #, which selects an element by id
#tab-month-end-content
When applied to the page this returns two matches and we want the second
Click on image to enlarge
The next part
.last
is a class selector, ., for class name last. This selects the tbody tag child element shown in the image above. As only the second id matched element has this child we are now working with the right parent element to go on and select the td type elements using type selector
td
The whitespace,, in between each part described above are known as descendant combinators, and they specify that elements matched by the second selector are selected if they have an ancestor element matching the first selector i.e. that the selector to the left is a parent of the selector matched elements retrieved by the adjacent css selector to the right.
We can see this with the next image:
Click on image to enlarge
VBA:
Option Explicit
Public Sub GetData()
Dim ie As Object, tickers(), ws As Worksheet, lastRow As Long
Dim results(), headers(), r As Long, i As Long, url As String
headers = Array("ticker", "3m", "6m", "ytd", "1y", "3y", "6y")
Set ws = ThisWorkbook.Worksheets("Sheet1")
tickers = Application.Transpose(ws.Range("N1:N" & GetLastRow(ws, 14)).Value)
ReDim results(1 To UBound(tickers), 1 To UBound(headers) + 1)
Set ie = GetObject("new:{D5E8041D-920F-45e9-B8FB-B1DEB82C6E5E}")
With ie
.Visible = True
For i = LBound(tickers) To UBound(tickers)
r = r + 1
url = Replace$("http://performance.morningstar.com/fund/performance-return.action?t=TICKER&region=usa&culture=en_US", "TICKER", tickers(i))
.Navigate2 url
While .Busy Or .readyState < 4: DoEvents: Wend
.document.querySelector("[tabname='#tabmonth']").Click
Dim rankings As Object
Do
Loop While .document.querySelectorAll("#tab-month-end-content .last td").Length = 0 'could add timed loop here
With .document
Set rankings = .querySelectorAll("#tab-month-end-content .last td")
On Error Resume Next
results(r, 1) = tickers(i)
results(r, 2) = rankings.item(1).innerText
results(r, 3) = rankings.item(2).innerText
results(r, 4) = rankings.item(3).innerText
results(r, 5) = rankings.item(4).innerText
results(r, 6) = rankings.item(5).innerText
results(r, 7) = rankings.item(6).innerText
On Error GoTo 0
End With
Set rankings = Nothing
Next
ws.Cells(1, 15).Resize(UBound(results, 1), UBound(results, 2)) = results
.Quit
End With
End Sub
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
End With
End Function
As mentioned by #SIM, you could use xmlhttp and avoid browser though not sure with your security settings whether need to whitelist sites. You will need to explore if the placeholder is valid in the url here: XNAS:TICKER. The XNAS prefix may vary across your tickers, in which case you would need the appropriate string including prefix in column N and then replace the extended placeholder with that e.g. .....=PLACEHOLDER&region.......
Option Explicit
Public Sub GetData()
Dim tickers(), ws As Worksheet, lastRow As Long
Dim results(), headers(), r As Long, i As Long, url As String, html As HTMLDocument
Set html = New HTMLDocument 'vbe > tools > references > Microsoft HTML Object Library
headers = Array("ticker", "3m", "6m", "ytd", "1y", "3y", "6y")
Set ws = ThisWorkbook.Worksheets("Sheet1")
tickers = Application.Transpose(ws.Range("N1:N" & GetLastRow(ws, 14)).Value)
ReDim results(1 To UBound(tickers), 1 To UBound(headers) + 1)
With CreateObject("MSXML2.XMLHTTP")
For i = LBound(tickers) To UBound(tickers)
r = r + 1
url = Replace$("http://performance.morningstar.com/perform/Performance/fund/trailing-total-returns.action?&t=XNAS:TICKER&region=usa&culture=en-US&cur=&ops=clear&s=0P0000J533&ndec=2&ep=true&align=m&annlz=true&comparisonRemove=false&loccat=&taxadj=&benchmarkSecId=&benchmarktype=", "TICKER", tickers(i))
.Open "GET", url, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.setRequestHeader "DNT", "1"
.send
html.body.innerHTML = .responseText
Dim rankings As Object
With html
Set rankings = .querySelectorAll(".last td")
On Error Resume Next
results(r, 1) = tickers(i)
results(r, 2) = rankings.item(1).innerText
results(r, 3) = rankings.item(2).innerText
results(r, 4) = rankings.item(3).innerText
results(r, 5) = rankings.item(4).innerText
results(r, 6) = rankings.item(5).innerText
results(r, 7) = rankings.item(6).innerText
On Error GoTo 0
End With
Set rankings = Nothing
Next
ws.Cells(1, 15).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
With ws
GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
End With
End Function

Related

Excel 2016 VBA web scraping using getElementsByClassName

My system is working with Windows 10 64bit and Office 2016
I am a bit familiar with Excel VBA, and I am trying to learn web scraping using VBA. Unfortunately, there is little information on in-depth digging on web data.
The website I want to scrape data from is bizbuysell.com from the seller's offers such as
Sample URL 1
Sample URL 2
There is a section that starts with the headline Detailed Information
The HTML code is: Detailed Information
<h3>Detailed Information</h3>
Location:
Pinellas County, FL
Inventory:
Included in asking price
Employees:
8 FT
I want to scrape data from this section.
The problem is that there are some 18 data labels and their respective values possible, but only those are shown for which the seller has entered data.
My idea was to search for all possible data labels and if they are not available then next data field
I tried it with the following code, but Obviously I made a mistake
For Each ele In doc.getElementsByClassName("listingProfile_details")
txt = ele.parentElement.innerText
If Left(txt, 8) = "Location" Then
location = Trim(Mid(txt, InStrRev(txt, ":") + 1))
ElseIf Left(txt, 4) = "Inventory" Then
inventory = Trim(Mid(txt, InStrRev(txt, ":") + 1))
.
.
.
End If
Next ele
I hope that someone can show me the correct VBA code to check for all 18 possible data labels and the respective data
Thank you so much!
Tony
One way it to gather a nodeList of the dt/dd elements and loop it with a step 2 so you can access the label at n indices and the value at n + 1.
To handle differing numbers of labels being present, you can initialise a fresh dictionary, with all the possible labels as keys, and the associated values as vbNullString, during the loop over urls, such that for each new XHR request you get a new dictionary ready to populate with the labels that are found. By using .Exists test, you only update the values for keys (labels) that are found at the current URI.
You can store all results in an array to write out to the sheet in one go at end.
There are lots of additional notes within the code.
Option Explicit
Public Sub GetDetailedBizBuySellInfo()
Dim http As Object, urls() As Variant
Dim html As MSHTML.HTMLDocument 'VBE > Tools > References > Microsoft HTML Object Library
urls = Array("https://www.bizbuysell.com/Business-Opportunity/covid-friendly-commercial-cleaning-est-30-years-100k-net/1753433/?d=L2Zsb3JpZGEvaGlsbHNib3JvdWdoLWNvdW50eS1idXNpbmVzc2VzLWZvci1zYWxlLzI/cT1hVEk5T0RFc01qQXNNekFzTnpnbWJtRndQV1UlM0Q=", _
"https://www.bizbuysell.com/Business-Opportunity/Established-Cleaning-Business-Tampa-St-Pete/1849521/?utm_source=bizbuysell&utm_medium=emailsite&utm_campaign=shtmlbot&utm_content=headline")
Set http = CreateObject("MSXML2.XMLHTTP")
Set html = New MSHTML.HTMLDocument
Dim url As Long, results() As Variant
ReDim results(1 To UBound(urls) + 1, 1 To 19) 'size the final output array. _
There will be the number of urls as row count, the number of labels as column count + 1 to store the url itself. You need to update the list of labels below. See GetBlankDetailedInformationDictionary
With http
For url = LBound(urls) To UBound(urls) 'loop url list
.Open "Get", urls(url), False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
html.body.innerHTML = .responseText
Dim currentDetailedInformation As Scripting.Dictionary 'VBE > Tools > References > Microsoft Scripting Runtime
Set currentDetailedInformation = GetCurrentDetailedInfo(html) 'use retrieved html to return a dictionary with key as dt > strong e.g.Location; value as dd e.g. Tampa, FL
AddCurrentDetailedInfoToResults results, currentDetailedInformation, url, urls(url) 'url + 1 (zero indexed) will keep track of current row number to add to results
Next
End With
With ActiveSheet 'better to update with explicit sheet/be careful not to overwrite data already in a sheet
.Cells(1, 1).Resize(1, UBound(results, 2)) = currentDetailedInformation.keys ' write out headers
.Cells(1, UBound(results, 2)) = "Url"
.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results ' write out results
End With
End Sub
Public Sub AddCurrentDetailedInfoToResults(ByRef results As Variant, ByVal currentDetailedInformation As Scripting.Dictionary, ByVal url As Long, ByVal currentUrl As String)
Dim key As Variant, currentColumn As Long
For Each key In currentDetailedInformation.keys
currentColumn = currentColumn + 1 'increase column count to update results array with
results(url + 1, currentColumn) = currentDetailedInformation(key)
Next
results(url + 1, currentColumn + 1) = currentUrl
End Sub
Public Function GetCurrentDetailedInfo(ByVal html As MSHTML.HTMLDocument) As Scripting.Dictionary
' Gathers a list of all the relevant dd, dt nodes within the passed in HTMLDocument.
' Requests a new blank dictionary whose keys are the labels (child strong element of dt tag)
'Updates blank dictionary, per key, where present, with dd value in a loop of step 2 as list is strong, dd, strong, dd etc.....
Dim updatedDictionary As Scripting.Dictionary, listOfLabelsAndValues As MSHTML.IHTMLDOMChildrenCollection
Set updatedDictionary = GetBlankDetailedInformationDictionary
'Css pattern to match the appropriate nodes
Set listOfLabelsAndValues = html.querySelectorAll("#ctl00_ctl00_Content_ContentPlaceHolder1_wideProfile_listingDetails_dlDetailedInformation dt > strong, #ctl00_ctl00_Content_ContentPlaceHolder1_wideProfile_listingDetails_dlDetailedInformation dd")
Dim currentIndex As Long
For currentIndex = 0 To listOfLabelsAndValues.length - 2 Step 2 'nodeList is 0 index based
'On Error Resume Next 'key (label) may not be present for current html document _
i.e. url so ignore errors when attempting to update blank dictionary via dt > strong matching on key. If label not found then value = vbNullString
Dim key As String, value As String
key = Trim$(listOfLabelsAndValues.Item(currentIndex).innerText)
value = Trim$(listOfLabelsAndValues.Item(currentIndex + 1).innerText) 'as we are looping every 2 indices 0,2,4 ....
If updatedDictionary.Exists(key) Then updatedDictionary(key) = value
'On Error GoTo 0
Next
Set GetCurrentDetailedInfo = updatedDictionary ' return updated dictionary
End Function
Public Function GetBlankDetailedInformationDictionary() As Scripting.Dictionary
Dim blankDictionary As Scripting.Dictionary, keys() As Variant, key As Long
Set blankDictionary = New Scripting.Dictionary
'' TODO Note: you would add in all 18 labels into array below.
keys = Array("Location:", "Type:", "Inventory:", "Real Estate:", "Building SF:", _
"Building Status:", "Lease Expiration:", "Employees:", "Furniture, Fixtures, & Equipment (FF&E):", _
"Facilities:", "Competition:", "Growth & Expansion:", "Financing:", "Support & Training:", _
"Reason for Selling:", "Franchise:", "Home-Based:", "Business Website:")
For key = LBound(keys) To UBound(keys)
blankDictionary(keys(key)) = vbNullString 'add blank entry to dictionary for each label
Next
Set GetBlankDetailedInformationDictionary = blankDictionary
End Function

Loop through NextSibling DIV tagname Selenium VBA

In my HTML page, at this link https://pastebin.com/nu0dLvch
There are elements which have the id "DetailSection1" .. And I am trying to loop through the div tags which is after this id (the tags are five in count)
This is my try but didn't work for me
Dim v, post As Object, sibling As Object, i As Long
Set post = .FindElementsByCss("#DetailSection1")
For i = 1 To post.Count
'Debug.Print post.Item(i).Attribute("outerHTML")
Set sibling = post.Item(i).NextSibling
Select Case sibling.NodeType
Case 3
v = sibling.NodeValue
Case 1
v = sibling.innerText
End Select
Debug.Print v
Next i
How can I loop through the five tags of DIV after this element of id "#DetailSection1"?
This is what I got as Arabic characters (Green is what I got while the yellow part is the correct characters)
1) If you want to use your logic then you will need to loop the nextElementSiblings until empty string is returned (rather than hardcode loop to 5). n.b. I am using nextElementSibling as I want the next sibling element node (node type 1).
W3C Element Traversal Specification
2.4. nextElementSibling:
Accessing this attribute of an element must return a reference to the sibling node of that element which most
immediately follows that element in document order, and which is of
nodeType 1, as an Element object. If the element on which this
attribute is accessed does not have any following sibling nodes, or if
none of those following sibling nodes are element nodes, then this
attribute must return null.
This is an ie version, as I can test that, but should be easy to translate. Selenium Basic does not expose nextElementSibling method so you can still use HTMLDocument and querySelector to access from page html, via transfer from pageSource of webdriver (assuming no non-trivial html changes with MSHTML.HTMLDocument), or switch to nextSibling traversal; and add in nodeType = 1 test during loop.
Option Explicit
Public Sub TEST()
Dim ie As New InternetExplorer
ie.Visible = True
'ie.Document.Charset = "utf-8" ''< may be require for handling Arabic chars. Not required with my settings.
ie.Navigate2 "file:///C:/Users/<User>/Desktop/Test.html"
While ie.Busy Or ie.readyState <> READYSTATE_COMPLETE: DoEvents: Wend
Dim startNodes As Object, node As Object, firstNode As Boolean
Dim i As Long, c As Long, r As Long
Set startNodes = ie.document.querySelectorAll("#DetailSection1")
r = 1
For i = 0 To startNodes.Length - 1 'you could determine number of child divs to get num columns then use step 5 loop or mod to write out in rows and cols
Set node = Nothing
firstNode = True
c = 1
Do
If firstNode Then
Set node = startNodes.item(i).nextElementSibling
Else
Set node = node.nextElementSibling
End If
If node.innerText <> vbNullString Then
c = c + 1 'you may need backwards loop to reverse output columns
ActiveSheet.Cells(r, c) = Trim$(node.innerText)
End If
firstNode = False
Loop Until node.innerText = vbNullString
r = r + 1
Next
ie.Quit
End Sub
2) You could dynamically pick up class and use a combination of css attribute selectors to hopefully correctly target nodes of interest. I have read in your html from file via IE. It is shaky due to reliance on attributes and relationships. It is all nested tables with few distinguishing features and likely dynamic attribute values. If not dynamic, then hardcode value for targetClass as ad66b5fc2d-4b59-45e6-b104-e14dfb5b1dac-0.
Option Explicit
Public Sub Test()
Dim ie As New InternetExplorer
ie.Visible = True
ie.Navigate2 "file:///C:/Users/User/Desktop/Test.html"
While ie.Busy Or ie.readyState <> READYSTATE_COMPLETE: DoEvents: Wend
Dim startNodes As Object, i As Long, targetClass As String, cssSelector As String
targetClass = "." & ie.document.querySelector("#DetailSection1").nextElementSibling.className ' "." & ie.document.querySelector("div[style*='center']").className
cssSelector = targetClass & "[style*='center'], " & targetClass & "[style*='center'] ~ div[style*='text-align']"
Set startNodes = ie.document.querySelectorAll(cssSelector)
For i = 0 To startNodes.Length - 2 'you could determine number of child divs to get num columns then use step 5 loop or mod to write out in rows and cols
ActiveSheet.Cells(i + 1, 1) = startNodes.item(i).innerText
Next
ie.Quit
End Sub
Untested selenium translation (can't test so written from my memory (eek!):
Dim startNodes As Object, i As Long, targetClass As String, cssSelector As String
targetClass = "." & .FindElementByCss("div[style*='center']").Attribute("class")
cssSelector = targetClass & "[style*='center'], " & targetClass & "[style*='center'] ~ div[style*='text-align']"
Set startNodes = .FindElementsByCss(cssSelector)
For i = 0 To startNodes.Count - 2 'you could determine number of child divs to get num columns then use step 5 loop or mode to write out in rows and cols
ActiveSheet.Cells(i + 1, 1) = startNodes.item(i).Text
Next
You can explore whether you can get targetClass from (nodeType = 1):
.FindElementByCss("#DetailSection1").nextSibling.className
.FindElementByCss("#DetailSection1").nextSibling.Attribute("class") '<== I think this
I'm sorry I can't test. I'm not sure those versions will work but would be good to know.
Ref:
https://stackoverflow.com/a/62366101/6241235 # أبو عائشة ورقية ومحمد
In fact, I am very satisfied with the solutions presented by QHarr. But I am eager to discover and learn new skills, so I am trying on my side after studying well what QHarr presented and this is my try-on selenium
Sub Test()
Dim bot As New ChromeDriver, a(1 To 1000, 1 To 5), post As Object, i As Long, j As Long
With bot
.AddArgument "--headless"
.Get "file:///C:\Sample.html"
Set post = .FindElementsByCss("#DetailSection1")
If post.Count > 0 Then
For i = 1 To post.Count
For j = 1 To 5
a(i, j) = Application.WorksheetFunction.Clean(Trim$(post.Item(i).FindElementsByXPath("following-sibling::div")(j).Text))
Next j
Next i
ActiveSheet.Range("A1").Resize(post.Count, UBound(a, 2)).Value = a
End If
End With
End Sub
** Note: I welcome any new and other ideas as I am eager to learn about different approaches.

Web Scraping - VBA

I am trying to scrape data from a website without any luck. i manage to navigate through Elements but I haven't managed to get the information from the last Elements. Below is my code, any help would be appreciated.
Option Explicit
Sub Download_Historical_Data()
Dim IE As InternetExplorer, doc As HTMLDocument
Dim All_Matches, Match
Dim All_Champions, Champion
'Open Browser and download data
Set IE = New InternetExplorer
With IE
.Visible = True
.Navigate ("https://www.scorespro.com/soccer/results/")
While .Busy Or .readyState < 4: DoEvents: Wend
Set doc = .document
End With
Set All_Champions = doc.getElementById("matches-data").getElementsByClassName("compgrp")
For Each Champion In All_Champions
Set All_Matches = Champion.getElementsByTagName("table")
For Each Match In All_Matches
If Left(Match.className, 12) = "blocks gteam" Then
With Match
'All the info
End With
End If
Next Match
Next Champion
IE.Quit
Set IE = Nothing
End Sub
Sample on 9/8/19:
Sample on 7/8/19:
Output:
The reason i have use as sample 2 different days is because there is a game with penalties and i want to include this as well.
You don't need to automate a browser. If you inspect the network traffic when selecting a date you will see an XHR request for the info. You can use those details (in fact I shorten to just the required url params) to retrieve the page content.
The info is contained in table tag elements. The champion is in tables with class name blockBar, otherwise the info is for the row info as seen on page. In order to leverage querySelector (which is a method of HTMLDocument) to select the sub table level elements, by class name, for individual tables, I stick the individual table html into a surrogate html document variable; I then have access to querySelector again and so can write nice flexible/descriptive css selectors to match on elements.
The columns in your output all have nice descriptive class names in the XHR response, so you can use those to determine which column to write to. As score info may risk losing formatting on output I use a Select Case statement, to test for those css selectors, and append a single quote to preserve formatting on output.
I choose, for efficiency, to store all results in an array and write out in one go.
Option Explicit
Public Sub GetMatchInfo()
Dim headers(), results(), r As Long, c As Long, ws As Worksheet, i As Long
Dim champion As String, html As HTMLDocument, html2 As HTMLDocument, cssSelectors(), j As Long
Set html = New HTMLDocument
Set html2 = New HTMLDocument
Set ws = ThisWorkbook.Worksheets("Sheet1")
headers = Array("Date", "Time", "Status", "Champion", "Home Team", "Full Time Score", "Away Team", "Half Time", "Penalties Score")
cssSelectors = Array(".kick_t_dt", ".kick_t_ko", ".status", "champion", ".home", ".score_link", ".away", ".halftime", ".after_pen")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.scorespro.com/soccer/ajax-calendar.php?mode=results&date=2019-08-07", False
.send
html.body.innerHTML = .responseText
End With
Dim tables As Object, selector As String
Set tables = html.querySelectorAll("table")
ReDim results(1 To tables.Length, 1 To UBound(headers) + 1)
For i = 0 To tables.Length - 1
If tables.item(i).className = "blockBar" Then
champion = tables.item(i).innerText
Else
r = r + 1
html2.body.innerHTML = tables.item(i).outerHTML
On Error Resume Next
For j = LBound(cssSelectors) To UBound(cssSelectors)
selector = cssSelectors(j)
Select Case selector
Case ".score_link", ".halftime", ".after_pen"
results(r, j + 1) = "'" & html2.querySelector(cssSelectors(j)).innerText
Case "champion"
results(r, j + 1) = champion
Case Else
results(r, j + 1) = html2.querySelector(cssSelectors(j)).innerText
End Select
Next
On Error GoTo 0
End If
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub
Example sample output:
Using IE
Option Explicit
Public Sub GetMatchInfo()
Dim headers(), results(), r As Long, c As Long, ws As Worksheet, i As Long
Dim champion As String, html As HTMLDocument, html2 As HTMLDocument, cssSelectors(), j As Long
Set html = New HTMLDocument
Set html2 = New HTMLDocument
Set ws = ThisWorkbook.Worksheets("Sheet1")
headers = Array("Date", "Time", "Status", "Champion", "Home Team", "Full Time Score", "Away Team", "Half Time", "Penalties Score")
cssSelectors = Array(".kick_t_dt", ".kick_t_ko", ".status", "champion", ".home", ".score_link", ".away", ".halftime", ".after_pen")
With CreateObject("InternetExplorer.Application")
.Navigate2 "https://www.scorespro.com/soccer/results/"
While .Busy Or .readyState <> 4: DoEvents: Wend
Application.Wait Now + TimeSerial(0, 0, 2)
html.body.innerHTML = .document.body.innerHTML
.Quit
End With
Dim tables As Object, selector As String
Set tables = html.querySelectorAll("table")
ReDim results(1 To tables.Length, 1 To UBound(headers) + 1)
For i = 0 To tables.Length - 1
If tables.item(i).className = "blockBar" Then
champion = tables.item(i).innerText
Else
r = r + 1
html2.body.innerHTML = tables.item(i).outerHTML
On Error Resume Next
For j = LBound(cssSelectors) To UBound(cssSelectors)
selector = cssSelectors(j)
Select Case selector
Case ".score_link", ".halftime", ".after_pen"
results(r, j + 1) = "'" & html2.querySelector(cssSelectors(j)).innerText
Case "champion"
results(r, j + 1) = champion
Case Else
results(r, j + 1) = html2.querySelector(cssSelectors(j)).innerText
End Select
Next
On Error GoTo 0
End If
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub
I wrote this in WSL (web scraping language) but basically you can edit the json to add any other fields (assuming all the football games). Once you got all the data, then you can either have it emailed to you or your web server.
GOTO www.scorespro.com/soccer/results/ >>
EXTRACT {'time': '.kick_t', 'status':'.status',
'home':'.home.uc', 'score':'.score', 'away':'.away', 'match':'a'} IN table tr
Explanation: it goes to that score page, and then pulls time, status, home, score, away fields for each table row via table tr and finally the match field which will come from the header bar table row. It will look like {'time':undefined, ...., 'match':'Armenia: Premier League'} along with other table row game schedules like {'time':'2019/8/21' ,..., 'match':undefined}. Just merge the JSON objects afterwards.

Web Scraping: Button clicking and help navigating through paths

I am trying to scrape some doctor names and addresses from the website: https://albertafindadoctor.ca/find-a-doc/directory
I am trying to solve the following issue:
Once on the doctor's toggle, I want to pull 4 pieces of data from the entire page, not just the first 25 displayed.
While the code works for the initial webpage, it only pulls the first 25 pieces of data. There are a significant number of other pages that I still need to pull (3822 different doctors).
Unfortunately, I'm at a loss on how to navigate and pull from these different pages. When I inspect elements to see how to navigate between pages a see matrix changing so I'm not sure if that has something to do with it?
Option Explicit
Sub GetAlbertaDoctors()
Dim objIE As InternetExplorer
Dim clinicEle As Object
Dim clinicName As String
Dim clinicAddress As String
Dim clinicCategory As String
Dim doctorName As String
Dim y As Integer
Set objIE = New InternetExplorer
objIE.Visible = True
objIE.Navigate "https://albertafindadoctor.ca/find-a-doc/directory"
While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Wend
Application.Wait (Now + TimeValue("0:00:5"))
objIE.Document.getElementsByClassName("physician-toggle")(0).Click
Application.Wait (Now + TimeValue("0:00:5"))
y = 2
For Each clinicEle In objIE.Document.getElementsByClassName("clinic")
clinicCategory = clinicEle.getElementsByClassName("pcn")(0).innerText
clinicName = clinicEle.getElementsByClassName("clinic-name")(0).innerText
doctorName = clinicEle.getElementsByTagName("h3")(0).innerText
clinicAddress = clinicEle.getElementsByClassName("address")(0).innerText
Sheets("Sheet2").Range("A" & y).Value = clinicCategory
Sheets("Sheet2").Range("B" & y).Value = clinicName
Sheets("Sheet2").Range("C" & y).Value = doctorName
Sheets("Sheet2").Range("D" & y).Value = clinicAddress
y = y + 1
Next
objIE.Quit
End Sub
When I run this, I get the error 91 "Object variable or With block variable not set" on the clicking line:
objIE.Document.getElementsByClassName("physician-toggle active")(0).Click
You don't need to loop all pages. You can use the browser to get to that page and click on Doctors if required. After that, grab the number of results and then mimic the xhr request the page makes for listings - which is returned as json. Alter the query string the page makes i.e. the parameter for limit to get all listings. Use a json parser (I use jsonconverter - instructions in the code for installation) to parse out your info.
There is a proper page load wait and a couple of loops to ensure elements are present. These should really be timed loops. See loop format here.
I add an additional test to ensure you do not attempt to click Doctors when it is not required to do so.
Not all listings has all info hence the On Error Resume Next paired with On Error GoTo 0. Looks like you may be able to build a dictionary to fill in some of the blank values based on existing paired values (or using ids present in json object).
I store all results in an array and write out in one go.
Option Explicit
'VBE > Tools > References:
' Microsoft Internet Controls
' Microsoft Scripting Runtime
'Download and add to standard module called jsonconverter from https://github.com/VBA-tools/VBA-JSON/blob/master/JsonConverter.bas
Public Sub GetListings()
Dim ie As InternetExplorer, s As String, json As Object, newUrl As String
Set ie = New InternetExplorer
With ie
.Visible = True
.Navigate2 "https://albertafindadoctor.ca/find-a-doc/directory"
While .Busy Or .readyState < 4: DoEvents: Wend
With .document.querySelector("[data-cp-option=physician]")
If Not .className = "physician-toggle active" Then .Click
End With
Dim resultsInfo() As String, numResults As Long, ele As Object
Do
On Error Resume Next
Set ele = .document.querySelector(".paginator")
On Error GoTo 0
Loop While ele Is Nothing
Do
Loop While .document.querySelector(".paginator").innerText = vbNullString
resultsInfo = Split(Trim$(.document.querySelector(".paginator").innerText), "of ")
.Quit
End With
numResults = resultsInfo(UBound(resultsInfo))
newUrl = "https://albertafindadoctor.ca/search/directory/physicians?page=1&limit=" & numResults & "&with[]=pcn&with[]=clinics&with[]=languages&with[]=specialties"
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", newUrl, False
.send
Set json = JsonConverter.ParseJson(.responseText)("items")
End With
Dim row As Object, results(), r As Long, headers(), ws As Worksheet, key As Variant
headers = Array("clinicCategory", "clinicName", "doctorName", "clinicAddress")
Set ws = ThisWorkbook.Worksheets("Sheet1")
ReDim results(1 To json.Count, 1 To UBound(headers) + 1)
For Each row In json
r = r + 1
On Error Resume Next
For Each key In row.keys
Select Case key
Case "clinical_name"
results(r, 3) = row(key)
Case "pcn"
results(r, 1) = row(key)("name")
Case "clinics"
results(r, 2) = row(key)(1)("name")
results(r, 4) = Join$(Array(row(key)(1)("street_address"), row(key)(1)("city"), row(key)(1)("province"), row(key)(1)("postal_code")), ", ")
End Select
Next
On Error GoTo 0
Next
With ws
.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Sample output:
Reading:
querySelector
json
css selectors
arrays and arrays2

eBay Product scraper

I am very limited on VBA,
The Code is in a Module, the code has a sub process as well, so sorry if I post the code wrong
A) open IE
B) Subprocess gets the data.
The code works fine on ebay.com but NOT for ebay.co.uk - can't work out why, also it converts urls to hyperlinks
It only does the first page, I need it to go through an X amount of pages - have a code but can't get it to work so have removed it.
Can the search query be run AFTER Ebay opens, so it opens, then search item is input to ebay and then code runs, or to run from a cell, IF its Cell A1 the data extracted needs to be pasted in A2 and below.
I have looked at elements for ebay.com and ebay.co.uk and they look the same to me, so can't work out why its not working as it works for 1 and not the other.
I did input the code for getting data from several pages it did not work. I know this code works as I have it for when I fetch urls from google
Public IE As New SHDocVw.InternetExplorer
Sub GetData()
Dim HTMLdoc As MSHTml.HTMLDocument
Dim othwb As Variant
Dim objShellWindows As New SHDocVw.ShellWindows
Set IE = CreateObject("internetexplorer.application")
With IE
.Visible = True
'.Navigate "https://www.ebay.co.uk/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=jackets&_sacat=0"
.Navigate "https://www.ebay.com/sch/i.html_from=R40&_nkw=ralph+lauren&_sacat=1059&LH_TitleDesc=0&_dmd=1&rt=nc"
While .Busy Or .readyState <> 4: DoEvents: Wend
Set HTMLdoc = IE.document
ProcessHTMLPage HTMLdoc
.Quit
End With
End Sub
code here
enter
'''''' THIS IS THE SUB PROCESS '''''
Sub ProcessHTMLPage(HTMLPage As MSHTml.HTMLDocument)
Dim HTMLItem As MSHTml.IHTMLElement
Dim HTMLItems As MSHTml.IHTMLElementCollection
Dim HTMLInput As MSHTml.IHTMLElement
Dim rownum As Long
rownum = 1
Set HTMLItems = HTMLPage.getElementsByClassName("s-item__title")
For Each HTMLItem In HTMLItems
Cells(rownum, 1).Value = HTMLItem.innerText
rownum = rownum + 1
Next HTMLItem
rownum = 1
Set HTMLItems = HTMLPage.getElementsByClassName("s-item__price")
For Each HTMLItem In HTMLItems
Cells(rownum, 2).Value = HTMLItem.innerText
rownum = rownum + 1
Next HTMLItem
rownum = 1
Set HTMLItems = HTMLPage.getElementsByClassName("s-item__link")
For Each HTMLItem In HTMLItems
Cells(rownum, 3).Value = HTMLItem.href
rownum = rownum + 1
Next HTMLItem
'Converts each text hyperlink selected into a working hyperlink from C1 to 25000 rows
Range("C1:C25000").Select
For Each xCell In Selection
ActiveSheet.Hyperlinks.Add Anchor:=xCell, Address:=xCell.Formula
Next xCell
Range("C1").Select
End Sub
Code for going to next page
pageNumber = 1
'i = 2
If pageNumber >= 6 Then Exit Do 'the first 6 pages
internetdata.getElementById("pnnext").click 'next web page
Do While internet.Busy Or internet.readyState <> 4
DoEvents
Loop
Set internetdata = internet.document
pageNumber = pageNumber + 1
Loop
Does not work on Ebay.co.uk - NO RESULTS ARE EXTRACTED - Works fine in ebay.com
Need it to get data from X amount of pages and NOT just 1 page
Can the search query be run AFTER Ebay opens, so it opens, then search item is input to ebay and then code runs, or to run from a cell, IF its Cell A1 the data extracted needs to be pasted in A2 and below.
This is my code for google search, I have got it working so the search comes from cell A1, I am look for something like this, I am going to see if I can use the ebay code with this. As this also does the first 25 pages in google search
enter Sub webpage()
Dim ie As Object
Dim htmlDoc As Object
Dim nextPageElement As Object
Dim div As Object
Dim link As Object
Dim url As String
Dim pageNumber As Long
Dim i As Long
' Takes seach from A1 and places it into google
url = "https://www.google.co.uk/search?q=" & Replace(Worksheets("Sheet1").Range("A1").Value, " ", "+")
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = True
.navigate url
Do While .Busy Or .readyState <> 4
DoEvents
Loop
End With
Application.Wait Now + TimeSerial(0, 0, 5)
Set htmlDoc = ie.document
pageNumber = 1
i = 2
Do
For Each div In htmlDoc.getElementsByTagName("div")
If div.getAttribute("class") = "r" Then
Set link = div.getElementsByTagName("a")(0)
Cells(i, 2).Value = link.getAttribute("href")
i = i + 1
End If
Next div
If pageNumber >= 25 Then Exit Do 'the first 25 pages
Set nextPageElement = htmlDoc.getElementById("pnnext")
If nextPageElement Is Nothing Then Exit Do
' Clicks web next page
nextPageElement.Click 'next web page
Do While ie.Busy Or ie.readyState <> 4
DoEvents
Loop
Application.Wait Now + TimeSerial(0, 0, 5)
Set htmlDoc = ie.document
pageNumber = pageNumber + 1
Loop
MsgBox "All Done"
Set ie = Nothing
Set htmlDoc = Nothing
Set nextPageElement = Nothing
Set div = Nothing
Set link = Nothing
End Sub
code here
Question 1: Why does it work for one domain but not the other?
To answer question 1 (the other questions should be new posts) - the html is not the same at all. The classes which work for ebay.com are not found in ebay.co.uk; So, your loop over collections doesn't do anything because they are count 0 (or length 0 with nodeLists if using querySelectorAll). Instead, you need branched code. Set your selectors based on the url domain.
I have used css selectors as this is the easiest, and fastest way, to select the required elements whilst maintaining the flexibility of a code re-factor to reduce the lines of repeated code.
Side note:
If you are unsure about whether your selection method will work across different pages you can do at least two things:
Right click > inspect element > visually check the class names are the same for the elements you are attempting to compare. So, if you are looking at product names, are the class names in the html the same on both pages?
You can use the search facility of the browser > open element tab via F12 then press Ctrl+F to pull up search box > enter your class name from the first page into this box in the second page and hit enter. You can also enter css selectors here and some cases regex. You will get a hit count telling you how many matches found. You can keep pressing enter to cycle through matches and each match will be highlighted in the html above, so you can easily compare if matched results are what you expected.
click image to enlarge
img url: https://i.stack.imgur.com/MWkEx.png
VBA:
Option Explicit
Public Sub GetData()
Dim htmlDoc As MSHTML.HTMLDocument, ie As SHDocVw.InternetExplorer, ws As Worksheet
Set ie = New SHDocVw.InternetExplorer
Set htmlDoc = New MSHTML.HTMLDocument
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ie
.Visible = True
'.Navigate2 "https://www.ebay.co.uk/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=jackets&_sacat=0"
.Navigate2 "https://www.ebay.com/sch/i.html_from=R40&_nkw=ralph+lauren&_sacat=1059&LH_TitleDesc=0&_dmd=1&rt=nc"
While .Busy Or .readyState <> 4: DoEvents: Wend
Dim index As Long, HTMLItems As Object, rowNum As Long, xCell As Range
Dim cssSelectors(), i As Long
Select Case True
Case InStr(.document.URL, "ebay.co.uk") > 0
cssSelectors = Array(".gvtitle a", ".amt", ".gvtitle a")
Case InStr(.document.URL, "ebay.com") > 0
cssSelectors = Array(".s-item__title", ".s-item__price", ".s-item__link")
End Select
With ws
For i = LBound(cssSelectors) To UBound(cssSelectors)
rowNum = 1
Set HTMLItems = ie.document.querySelectorAll(cssSelectors(i))
For index = 0 To HTMLItems.length - 1
.Cells(rowNum, i + 1).Value = IIf(i = 2, HTMLItems.item(index).getAttribute("href"), HTMLItems.item(index).innerText)
rowNum = rowNum + 1
Next
Next
For Each xCell In .Range("C1:C25000") '<= all these really?
.Hyperlinks.Add Anchor:=xCell, Address:=xCell.Formula
Next xCell
End With
.Quit
End With
End Sub
If this works on eBay then you need to find out yourself why it's not working on ebay.co.uk. My point is if the code itself works than there is nothing we can help you with here. You need to take some time to investigate ebay.co.uk and find the differences as I am sure it's something minor. I can't help u fix code that isn't actually broken. I wish you luck though.

Resources