VBA - web scraping can not find correct GET request - excel

My question is related to other question VBA - web scraping can not get HTMLElement innerText. I have a similar problem
Website URL - https://www.hnb.hr/en/core-functions/monetary-policy/exchange-rate-list/exchange-rate-list
I need to get the date of currency reference and the selected values. The problem is that I can not find a correct GET request where these values are finally generated. I've found that it is related to the POST request:
POST /en/core-functions/monetary-policy/exchange-rate-list/exchange-rate-list?p_p_id=tecajnalistacontroller_WAR_hnbtecajnalistaportlet&p_p_lifecycle=2&p_p_state=normal&p_p_mode=view&p_p_resource_id=getTecajnaAjaxDataURL&p_p_cacheability=cacheLevelPage&p_p_col_id=column-2&p_p_col_count=2 HTTP/1.1
I would like to use a technique with getting by id, class or tag - whatever but again, provided GET URL request is too quick to retrieve the required info

XMLHTTP request and API:
I would use their API as shown below. I have some helper functions to aid with parsing the response. In GetDict function you can set the currencies you are interested in. In function GetRate you can specify the rate you are interested in. If you don't specify, it defaults to "median_rate".
Calling the API:
To get the rates for a particular date, make a[n] HTTP call to the
following URL:
http://hnbex.eu/api/v1/rates/daily/?date=YYYY-MM-DD
The date parameter is optional. If not set, the current date (today)
is used.
You can parse the JSON response with a JSON parser but I found it simpler to go with using Split to grab the required info from the JSON string. If you are familiar with JSON I will happily update with a JSON parsing example.
Option Explicit
Public Sub GetInfo()
'http://hnbex.eu/api/v1/
Dim strJSON As String, http As Object, json As Object
Const URL As String = "http://hnbex.eu/api/v1/rates/daily/"
Set http = CreateObject("MSXML2.XMLHTTP")
With http
.Open "GET", URL, False
.send
strJSON = .responseText
End With
'Set json = JsonConverter.ParseJson(strJSON) '<== You could parse the JSON using a JSON parse such as [JSONConverter][1]
Dim currencyDict As Object
Set currencyDict = GetDict
Dim key As Variant, dictKeys As Variant, result As Variant
For Each key In currencyDict.keys
result = GetRate(strJSON, key)
If Not IsError(result) Then currencyDict(key) = result
result = vbNullString
Next key
PrintDictionary currencyDict
End Sub
Public Function GetDict() As Object '<== You could adapt to pass currencies as string arguments to the function. Or even a string array.
Dim dict As Object
Set dict = CreateObject("Scripting.Dictionary")
dict.Add "EUR", vbNullString
dict.Add "CZK", vbNullString
dict.Add "HRK", vbNullString
dict.Add "HUF", vbNullString
dict.Add "PLN", vbNullString
dict.Add "RON", vbNullString
dict.Add "RSD", vbNullString
Set GetDict = dict
End Function
Public Function GetRate(ByVal json As String, ByVal key As Variant, Optional ByVal rate As String = "median_rate") As Variant
Dim arr() As String, tempString As String
On Error GoTo Errhand
arr = Split(json, """currency_code"": " & Chr$(34) & key & Chr$(34))
tempString = arr(1)
tempString = Split(arr(1), Chr$(34) & rate & Chr$(34) & ":")(1)
tempString = Split(tempString, ",")(0)
GetRate = tempString
Exit Function
Errhand:
GetRate = CVErr(xlErrNA)
End Function
Public Sub PrintDictionary(ByVal dict As Object)
Dim key As Variant
For Each key In dict.keys
Debug.Print key & " : " & dict(key)
Next
End Sub
Internet Explorer:
You can use an loop with explicit wait for element to be present on page (or populated)
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer, t As Date, hTable As HTMLTable, clipboard As Object
Const WAIT_TIME_SECS As Long = 5
t = Timer
With IE
.Visible = True
.navigate "https://www.hnb.hr/en/core-functions/monetary-policy/exchange-rate-list/exchange-rate-list"
While .Busy Or .readyState < 4: DoEvents: Wend
Do
DoEvents
On Error Resume Next
Set hTable = .document.getElementById("records_table")
On Error GoTo 0
If Timer - t > WAIT_TIME_SECS Then Exit Do
Loop While hTable Is Nothing
If hTable Is Nothing Then
.Quit
Exit Sub
End If
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
clipboard.SetText hTable.outerHTML
clipboard.PutInClipboard
ThisWorkbook.Worksheets("Sheet1").Cells(1, 1).PasteSpecial
.Quit '<== Remember to quit application
End With
End Sub

Related

Scrape economic data from investing website

I am working on a code to get data from : https://www.investing.com/economic-calendar/core-durable-goods-orders-59
I have got the code for getting this via httprequest: but looking to change this to work for the economic data (link above) is there any way I can get the same for the economic indicators??
code below:
Option Explicit
Sub Export_Table()
'Html Objects---------------------------------------'
Dim htmlDoc As MSHTML.HTMLDocument
Dim htmlBody As MSHTML.htmlBody
Dim ieTable As MSHTML.HTMLTable
Dim Element As MSHTML.HTMLElementCollection
'Workbooks, Worksheets, Ranges, LastRow, Incrementers ----------------'
Dim wb As Workbook
Dim Table As Worksheet
Dim i As Long
Set wb = ThisWorkbook
Set Table = wb.Worksheets("Sheet1")
'-------------------------------------------'
Dim xmlHttpRequest As New MSXML2.XMLHTTP60 '
'-------------------------------------------'
i = 2
'Web Request --------------------------------------------------------------------------'
With xmlHttpRequest
.Open "POST", "https://www.investing.com/instruments/HistoricalDataAjax", False
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
.setRequestHeader "X-Requested-With", "XMLHttpRequest"
.send "curr_id=951681&smlID=1695217&header=CLNX+Historical+Data&st_date=01%2F01%2F2017&end_date=03%2F01%2F2019&interval_sec=Monthly&sort_col=date&sort_ord=DESC&action=historical_data"
If .Status = 200 Then
Set htmlDoc = CreateHTMLDoc
Set htmlBody = htmlDoc.body
htmlBody.innerHTML = xmlHttpRequest.responseText
Set ieTable = htmlDoc.getElementById("curr_table")
For Each Element In ieTable.getElementsByTagName("tr")
Table.Cells(i, 1) = Element.Children(0).innerText
Table.Cells(i, 2) = Element.Children(1).innerText
Table.Cells(i, 3) = Element.Children(2).innerText
Table.Cells(i, 4) = Element.Children(3).innerText
Table.Cells(i, 5) = Element.Children(4).innerText
Table.Cells(i, 6) = Element.Children(5).innerText
Table.Cells(i, 7) = Element.Children(6).innerText
i = i + 1
DoEvents: Next Element
End If
End With
Set xmlHttpRequest = Nothing
Set htmlDoc = Nothing
Set htmlBody = Nothing
Set ieTable = Nothing
Set Element = Nothing
End Sub
Public Function CreateHTMLDoc() As MSHTML.HTMLDocument
Set CreateHTMLDoc = CreateObject("htmlfile")
End Function
I have used the excel tool Power Query for this very thing. It is also called Get & Transform Data. I don't like using vba for doing this type of thing.
To make it work:
In Excel Go to Data>Get Data>From Other Sources>From Web.
Enter the URL
Wait for the webpage to load and then pick the table that you want.
This website took awhile to load, but it did work for me.
Choose "Load" which goes directly to the sheet, or "Transform Data" to manipulate the data in Power Query. There are many options in power query such as split columns, filter data, Calculate Columns and ...
I would avoid the overhead of setting up a permanent connection and simply continue using XHR. With the data > from web, you cannot grab more rows than are present on the initial landing. If however you go with XHR, you can issue POST requests to get more data. The code below utilizes a loop to retrieve additional results beyond the immediate visible on the page.
When you press the Show more link there is a POST request for an additional 6 rows which uses the latest date from the current set of results as part of the POST body. The response returned is JSON. Rather than bringing in a JSON parser, given the standard nature of the JSON, and that I am already using regex to clean the date format in column 1 to put in the POST body, I use two simple regexes to extract the html for the next results table from the response, and to check whether there are more results.
The format of the JSON is:
{
"historyRows": "<tr>…..</tr>",
"hasMoreHistory": "1"
}
Or
{
"historyRows": "<tr>…..</tr>",
"hasMoreHistory": false
}
So, I do some cleaning of the extracted html in order to not confuse the html parser within MSHTML. Furthermore, I add in an id to identify the table I have constructed, so I can continue to use an id css selector (#) list within my UpdateDateResults function.
I initially oversize an array to store each retrieved table which I update ByRef. I loop requesting more results until either there are no more results, there is an error parsing the maximum date from the last retrieved table column 1, or until my specified earliest date for data retrieval falls within the date range of the latest returned table.
Finally, I write the results array out to the sheet in one go.
N.B. You can target the table by its id. It looks like the number at the end of the id could be the same as for the goods url, lending itself to generalizing the code below to work for other goods.
VBA:
Option Explicit
Public Sub GetInvestingInfo()
'tools > references > Microsoft HTML Object Library
Dim html As MSHTML.HTMLDocument, xhr As Object
Const STARTDATE As Date = "2019-11-25" 'Adjust as required. DateAdd("yyyy", -2, Date) 2 years back. This means may have some earlier months in _
batch that spans the start date but won't issue an additional request after this
Set xhr = CreateObject("MSXML2.XMLHTTP")
Set html = New MSHTML.HTMLDocument
With xhr
.Open "GET", "https://www.investing.com/economic-calendar/core-durable-goods-orders-59", False
.setRequestHeader "User-Agent", "Safari/537.36"
.send
html.body.innerHTML = .responseText
End With
Dim firstTable As Boolean, r As Long, results() As Variant
ReDim results(1 To 100000, 1 To 5)
'process initial table and update results, get cleaned date needed for request for more results
firstTable = True
Dim latestDate As String
UpdateDateResults latestDate, results, firstTable, r, html
Dim re As Object, maxDate As String, hasMoreHistory As Boolean, s As String
Set re = CreateObject("VBScript.RegExp")
With re
.Global = True
.MultiLine = False
End With
maxDate = cleanedDate(latestDate, re)
hasMoreHistory = True
Dim errorDate As Date
errorDate = DateAdd("d", 1, Date)
Do While maxDate >= STARTDATE And maxDate < errorDate 'break loop using pre-defined earliest date, error with date conversion, or when no more rows found
Application.Wait (Now + TimeSerial(0, 0, 1)) 'Pause
s = GetMoreRows(xhr, Format$(maxDate, "YYYY-MM-DD")) 'max a POST request for more data
re.Pattern = "hasMoreHistory"":(""?.*?""?)}" 'Check if there are more rows still available. "1" for yes, false for no
hasMoreHistory = (re.Execute(s)(0).submatches(0) <> False)
If Not hasMoreHistory Then Exit Do
re.Pattern = "historyRows"":""(.*)"","
html.body.innerHTML = "<table id=""me"">" & Replace$(re.Execute(s)(0).submatches(0), "\/", "/") & "</table>" 'fix html and feed into html variable
UpdateDateResults latestDate, results, firstTable, r, html
maxDate = cleanedDate(latestDate, re) 'convert value retrieved from last row in date column of table to an actual date
Loop
With ActiveSheet
.Cells.ClearContents
.Cells(1, 1).Resize(r, 5) = results 'Don't bother to resize results as clear all cells before write ou
End With
End Sub
Public Sub UpdateDateResults(ByRef latestDate As String, ByRef results() As Variant, ByRef firstTable As Boolean, ByRef r As Long, ByVal html As MSHTML.HTMLDocument)
Dim table As MSHTML.HTMLTable 'return latest date from function
Set table = html.querySelector("#eventHistoryTable59, #me")
latestDate = table.Rows(table.Rows.Length - 1).Children(0).innerText
Dim i As Long, n As Long, j As Long
n = IIf(firstTable, 0, 1)
For i = n To table.Rows.Length - 1
r = r + 1
For j = 0 To table.Rows(i).Children.Length - 2
results(r, j + 1) = table.Rows(i).Children(j).innerText
Next
Next
firstTable = False
End Sub
Public Function cleanedDate(ByVal dirtyString As String, ByVal re As Object) As Date
re.Pattern = "(^[A-Z][a-z]{2}).*(\d{2}),.(\d{4})(.*)"
On Error GoTo errhand:
If re.test(dirtyString) Then
cleanedDate = CDate(re.Replace(dirtyString, "$2" & Chr$(32) & "$1" & Chr$(32) & "$3"))
Exit Function
End If
errhand:
cleanedDate = DateAdd("d", 1, Date)
End Function
Public Function GetMoreRows(ByVal xhr As Object, ByVal dateStamp As String) As String
With xhr
.Open "POST", "https://www.investing.com/economic-calendar/more-history", False
.setRequestHeader "User-Agent", "Safari/537.36"
.setRequestHeader "x-requested-with", "XMLHttpRequest"
.setRequestHeader "content-type", "application/x-www-form-urlencoded"
.send "eventID=430865&event_attr_ID=59&event_timestamp=" & dateStamp & "+" & Application.WorksheetFunction.EncodeURL("12:30:00") & "&is_speech=0"
GetMoreRows = .responseText
End With
End Function
Regexes (without the double " escaping for VBA):
hasMoreHistory":("?.*?"?)}
historyRows":"(.*)",

Scraping using VBA

i am trying to extract one figure from a gov website, I have done a lot of googling and I am kinda lost for ideas, my code below returns a figure but it isnt the figure I want to get and I am not entirely sure why.
I want to subtract the figure from the 'Cases by Area (Whole Pandemic)' table 'Upper tier LA' section and 'Southend on Sea' Case number.
https://coronavirus.data.gov.uk/details/cases
I stole this code from online somewhere and tried to replicate with my class number I found within F12 section on the site.
Sub ExtractLastValue()
Set objIE = CreateObject("InternetExplorer.Application")
objIE.Top = 0
objIE.Left = 0
objIE.Width = 800
objIE.Height = 600
objIE.Visible = True
objIE.Navigate ("https://coronavirus.data.gov.uk/details/cases")
Do
DoEvents
Loop Until objIE.readystate = 4
MsgBox objIE.document.getElementsByClassName("sc-bYEvPH khGBIg govuk-table__cell govuk-table__cell--numeric ")(0).innerText
Set objIE = Nothing
End Sub
Data comes from the official API and returns a json response dynamically on that page when you click the Upper Tier panel.
Have a look and play with the API guidance
here:
https://coronavirus.data.gov.uk/details/developers-guide
You can make a direct xhr request by following the guidance in the API documentation and then using a json parser to handle the response. For your request it would be something like the following:
https://coronavirus.data.gov.uk/api/v1/data?filters=areaName=Southend-on-Sea&areaType=utla&latestBy=cumCasesByPublishDate&structure=
{"date":"date", "areaName":"areaName","cumCasesByPublishDate":"cumCasesByPublishDate",
"cumCasesByPublishDateRate":"cumCasesByPublishDateRate"}
XHR:
A worked example using jsonconverter.bas as the json parser
Option Explicit
Public Sub GetCovidNumbers()
Dim http As Object, json As Object
Set http = CreateObject("MSXML2.XMLHTTP")
With http
.Open "GET", "https://coronavirus.data.gov.uk/api/v1/data?filters=areaName=Southend-on-Sea&areaType=utla&latestBy=cumCasesByPublishDate&structure={""date"":""date"",""areaName"":""areaName"",""cumCasesByPublishDate"":""cumCasesByPublishDate"",""cumCasesByPublishDateRate"":""cumCasesByPublishDateRate""}", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
Set json = JsonConverter.ParseJson(.responseText)("data")(1)
End With
With ActiveSheet
Dim arr()
arr = json.Keys
.Cells(1, 1).Resize(1, UBound(arr) + 1) = arr
arr = json.Items
.Cells(2, 1).Resize(1, UBound(arr) + 1) = arr
End With
End Sub
Json library (Used in above solution):
I use jsonconverter.bas. Download raw code from here and add to standard module called JsonConverter . You then need to go VBE > Tools > References > Add reference to Microsoft Scripting Runtime. Remove the top Attribute line from the copied code.
Internet Explorer:
You could do a slower, more complicated, internet explorer solution where you need to select the utla option when present, then select from the table the desired value:
Option Explicit
Public Sub GetCovidNumbers()
'Tools references Microsoft Internet Controls and Microsoft HTML Object Library
Dim ie As SHDocVw.InternetExplorer, t As Date, ele As Object
Const MAX_WAIT_SEC As Long = 10
Set ie = New SHDocVw.InternetExplorer
With ie
.Visible = True
.Navigate2 "https://coronavirus.data.gov.uk/details/cases"
While .Busy Or .ReadyState <> READYSTATE_COMPLETE: DoEvents: Wend
t = Timer 'timed loop for element to be present to click on (to get utla)
Do
On Error Resume Next
Set ele = .Document.querySelector("#card-cases_by_area_whole_pandemic [aria-label='Upper tier LA']")
On Error GoTo 0
If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop While ele Is Nothing
If ele Is Nothing Then Exit Sub
ele.Click
While .Busy Or .ReadyState <> READYSTATE_COMPLETE: DoEvents: Wend
Dim table As MSHTML.HTMLTable, datetime As String, result()
Set table = .Document.querySelector("table[download='cumCasesByPublishDate,cumCasesByPublishDateRate']")
datetime = .Document.querySelector("time").getAttribute("datetime")
result = GetDataForUtla("Southend-on-Sea", datetime, table)
With ActiveSheet
.Cells(1, 1).Resize(1, 4) = Array("Datetime", "Area", "Cases", "Rate per 100,000 population")
.Cells(2, 1).Resize(1, UBound(result) + 1) = result
End With
.Quit
End With
End Sub
Public Function GetDataForUtla(ByVal utla As String, ByVal datetime As String, ByVal table As MSHTML.HTMLTable) As Variant
Dim row As MSHTML.HTMLTableRow, i As Long
For Each row In table.Rows
If InStr(row.outerHTML, utla) > 0 Then
Dim arr(4)
arr(0) = datetime
For i = 0 To 2
arr(i + 1) = row.Children(i).innerText
Next
GetDataForUtla = arr
Exit Function
End If
Next
GetDataForUtla = Array("Not found")
End Function
References:
https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector

Web Scraping: Button clicking and help navigating through paths

I am trying to scrape some doctor names and addresses from the website: https://albertafindadoctor.ca/find-a-doc/directory
I am trying to solve the following issue:
Once on the doctor's toggle, I want to pull 4 pieces of data from the entire page, not just the first 25 displayed.
While the code works for the initial webpage, it only pulls the first 25 pieces of data. There are a significant number of other pages that I still need to pull (3822 different doctors).
Unfortunately, I'm at a loss on how to navigate and pull from these different pages. When I inspect elements to see how to navigate between pages a see matrix changing so I'm not sure if that has something to do with it?
Option Explicit
Sub GetAlbertaDoctors()
Dim objIE As InternetExplorer
Dim clinicEle As Object
Dim clinicName As String
Dim clinicAddress As String
Dim clinicCategory As String
Dim doctorName As String
Dim y As Integer
Set objIE = New InternetExplorer
objIE.Visible = True
objIE.Navigate "https://albertafindadoctor.ca/find-a-doc/directory"
While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Wend
Application.Wait (Now + TimeValue("0:00:5"))
objIE.Document.getElementsByClassName("physician-toggle")(0).Click
Application.Wait (Now + TimeValue("0:00:5"))
y = 2
For Each clinicEle In objIE.Document.getElementsByClassName("clinic")
clinicCategory = clinicEle.getElementsByClassName("pcn")(0).innerText
clinicName = clinicEle.getElementsByClassName("clinic-name")(0).innerText
doctorName = clinicEle.getElementsByTagName("h3")(0).innerText
clinicAddress = clinicEle.getElementsByClassName("address")(0).innerText
Sheets("Sheet2").Range("A" & y).Value = clinicCategory
Sheets("Sheet2").Range("B" & y).Value = clinicName
Sheets("Sheet2").Range("C" & y).Value = doctorName
Sheets("Sheet2").Range("D" & y).Value = clinicAddress
y = y + 1
Next
objIE.Quit
End Sub
When I run this, I get the error 91 "Object variable or With block variable not set" on the clicking line:
objIE.Document.getElementsByClassName("physician-toggle active")(0).Click
You don't need to loop all pages. You can use the browser to get to that page and click on Doctors if required. After that, grab the number of results and then mimic the xhr request the page makes for listings - which is returned as json. Alter the query string the page makes i.e. the parameter for limit to get all listings. Use a json parser (I use jsonconverter - instructions in the code for installation) to parse out your info.
There is a proper page load wait and a couple of loops to ensure elements are present. These should really be timed loops. See loop format here.
I add an additional test to ensure you do not attempt to click Doctors when it is not required to do so.
Not all listings has all info hence the On Error Resume Next paired with On Error GoTo 0. Looks like you may be able to build a dictionary to fill in some of the blank values based on existing paired values (or using ids present in json object).
I store all results in an array and write out in one go.
Option Explicit
'VBE > Tools > References:
' Microsoft Internet Controls
' Microsoft Scripting Runtime
'Download and add to standard module called jsonconverter from https://github.com/VBA-tools/VBA-JSON/blob/master/JsonConverter.bas
Public Sub GetListings()
Dim ie As InternetExplorer, s As String, json As Object, newUrl As String
Set ie = New InternetExplorer
With ie
.Visible = True
.Navigate2 "https://albertafindadoctor.ca/find-a-doc/directory"
While .Busy Or .readyState < 4: DoEvents: Wend
With .document.querySelector("[data-cp-option=physician]")
If Not .className = "physician-toggle active" Then .Click
End With
Dim resultsInfo() As String, numResults As Long, ele As Object
Do
On Error Resume Next
Set ele = .document.querySelector(".paginator")
On Error GoTo 0
Loop While ele Is Nothing
Do
Loop While .document.querySelector(".paginator").innerText = vbNullString
resultsInfo = Split(Trim$(.document.querySelector(".paginator").innerText), "of ")
.Quit
End With
numResults = resultsInfo(UBound(resultsInfo))
newUrl = "https://albertafindadoctor.ca/search/directory/physicians?page=1&limit=" & numResults & "&with[]=pcn&with[]=clinics&with[]=languages&with[]=specialties"
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", newUrl, False
.send
Set json = JsonConverter.ParseJson(.responseText)("items")
End With
Dim row As Object, results(), r As Long, headers(), ws As Worksheet, key As Variant
headers = Array("clinicCategory", "clinicName", "doctorName", "clinicAddress")
Set ws = ThisWorkbook.Worksheets("Sheet1")
ReDim results(1 To json.Count, 1 To UBound(headers) + 1)
For Each row In json
r = r + 1
On Error Resume Next
For Each key In row.keys
Select Case key
Case "clinical_name"
results(r, 3) = row(key)
Case "pcn"
results(r, 1) = row(key)("name")
Case "clinics"
results(r, 2) = row(key)(1)("name")
results(r, 4) = Join$(Array(row(key)(1)("street_address"), row(key)(1)("city"), row(key)(1)("province"), row(key)(1)("postal_code")), ", ")
End Select
Next
On Error GoTo 0
Next
With ws
.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Sample output:
Reading:
querySelector
json
css selectors
arrays and arrays2

Web Scraping ETFs Daily Data VBA

I'm trying to web scrape some daily info of differents ETFs. I found that https://www.marketwatch.com/ have a accurate info.
The most relevant info is the open Price, outstanding shares, NAV, total assets of the ETF.
Here is the link for IVV US Equity: https://www.marketwatch.com/investing/fund/ivv
I have web scraped with VBA before but the HTML of the pages I had used are different, I don't know if this is because some values of the ETFs (such as Price and Taded Volume) change constantly.
The idea is to create a code to extract relevant info and create a data base to analyze Macroeconomics factor using the ETFs as market indicators of flows between countries, regions, etc...
Mi first approach would be with VBA but after I get more into the data I would like to try with Python (after I get more conffident with it) to automate the webscraping process on a daily basis.
I am open to any suggestion or any other website that could be useful (I have tried with Yahoo Finance and Morningstar and I get the same problema with the HTML code).
This is my poor code:
Sub Get_Data()
Dim ticker As String, enlace As String
ticker = ThisWorkbook.Worksheets("ETFs").Cells(2, 2).Value 'IVV
'link = "https://www.morningstar.com/etfs/arcx/" & ticker & "/quote.html"
'link = "https://finance.yahoo.com/quote/" & ticker & "?p=" & ticker
link = "https://www.marketwatch.com/investing/fund/" & ticker
Application.ScreenUpdating = False
Dim x As Integer
x = ThisWorkbook.Worksheets("ETFs").Cells(Rows.Count, 1).End(xlUp).Row
'Dim i As Integer
'For i = 2 To x
Dim total_net_assets As Variant, open_price As Variant, NAV As Variant, shares_out
Set ie = CreateObject("InternetExplorer.application")
With ie
.Visible = False
.navigate link
While .Busy Or .readyState < 4: DoEvents: Wend
Do
DoEvents
On Error Resume Next
' Here is where I get the problem of not knowing how to reference the values I need because the class name appears repeatedly
total_net_assets = .document.getElementsByClassName("").Value
open_price = .document.getElementByClassName("price").Value
NAV = .document.getElementByClassName("").Value
shares_out = .document.getElementByClassName("kv__value kv__primary ").Value
On Error GoTo 0
Loop
End With
ThisWorkbook.Worksheets("ETFs").Cells(2, 13).Value = total_net_assets
ThisWorkbook.Worksheets("ETFs").Cells(2, 14).Value = NAV
ThisWorkbook.Worksheets("ETFs").Cells(2, 15).Value = open_price
ThisWorkbook.Worksheets("ETFs").Cells(2, 16).Value = shares_out
ie.Quit
'Next i
Application.ScreenUpdating = True
End Sub
Access method:
I use XMLHTTP requests as much faster than opening IE.
Code notes:
The following reads in fund short codes from Sheet1 column A, starting in A2, into an array. You can easily extend this adding more funds into column A.
This array is looped issuing XMLHTTP requests by concatenating the fund code into the BASE_URL variable.
I use a class, clsHTTP, to hold the XMLHTTP object to be efficient - no need to keep creating and destroying the object.
I provide this class with two methods. One to retrieve the target page innerHTML (GetString), and the other to extract the required info if available (GetInfo). I use a dictionary to test if the searched for labels are present. If present I grab the associated value. If not, I have a placeholder vbNullString in the dictionary.
I add each scraped result into a collection called results. At the end I loop this writing out to the sheet. By keeping most of the work in memory this provides for much faster scraping.
Retrieving info from HTML:
The labels e.g. Open, and values come in pairs.
You can generate a nodeList (think collection as with getElementsByClassName) by using querySelectorAll method to apply a class CSS selector to gather the label elements by their class name kv__label. The "." is the class selector.
Set labels = .querySelectorAll(".kv__label") '<== nodeList of labels
You do the same to get the associated values:
Set values = .querySelectorAll(".kv__value.kv__primary") '<== nodeList of associated values. Same length as labels nodeList so can use same index to retrieve associated label/value pairs from each nodeList.
You loop the labels using the dictionary in the clsHTTP method .GetInfo to see if you searched for labels are present, if they are, the associated value is retrieved from values by using the same index as where the label was found in the nodeList labels, and the dictionary vbNullString value for that label is updated with the actual retrieved value, else it is left as vbNullString.
Sample results:
VBA:
Class module clsHTTP:
Option Explicit
Private http As Object
Private Sub Class_Initialize()
Set http = CreateObject("MSXML2.XMLHTTP")
End Sub
Public Function GetString(ByVal url As String) As String
Dim sResponse As String
With http
.Open "GET", url, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
GetString = sResponse
End With
End Function
Public Function GetInfo(ByVal html As HTMLDocument) As Object
Dim dict As Object, i As Long
Set dict = CreateObject("Scripting.Dictionary")
dict.Add "Open", vbNullString
dict.Add "Shares Outstanding", vbNullString
dict.Add "Total Net Assets", vbNullString
dict.Add "NAV", vbNullString
Dim values As Object, labels As Object
With html
Set values = .querySelectorAll(".kv__value.kv__primary")
Set labels = .querySelectorAll(".kv__label")
For i = 0 To labels.Length - 1
If dict.Exists(labels.item(i).innerText) Then dict(labels.item(i).innerText) = values.item(i).innerText
Next
End With
Set GetInfo = dict
End Function
Standard module 1:
Option Explicit
Public Sub GetFundInfo()
Dim sResponse As String, html As HTMLDocument, http As clsHTTP, i As Long
Dim headers(), funds(), url As String, results As Collection, ws As Worksheet
Const BASE_URL As String = "https://www.marketwatch.com/investing/fund/"
Application.ScreenUpdating = False
headers = Array("Open", "Shares Outstanding", "Total Net Assets", "NAV")
Set results = New Collection
Set http = New clsHTTP
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
funds = Application.Transpose(ws.Range("A2:A3").Value) '<== Change the range here to the single column range containing your dotNums.
For i = LBound(funds) To UBound(funds)
If Not IsEmpty(funds(i)) Then
url = BASE_URL & funds(i)
html.body.innerHTML = http.GetString(url)
results.Add http.GetInfo(html).Items
End If
Next
If results.Count > 0 Then
Dim item As Variant, r As Long, c As Long
r = 2: c = 2
With ws
.Cells(1, c).Resize(1, UBound(headers) + 1) = headers
For Each item In results
.Cells(r, c).Resize(1, UBound(item) + 1) = item
r = r + 1
Next
End With
End If
Application.ScreenUpdating = True
End Sub
Set-up:
Without using a class:
Option Explicit
Public Sub GetFundInfo()
Dim sResponse As String, html As HTMLDocument, i As Long
Dim headers(), funds(), url As String, results As Collection, ws As Worksheet
Const BASE_URL As String = "https://www.marketwatch.com/investing/fund/"
Application.ScreenUpdating = False
headers = Array("Open", "Shares Outstanding", "Total Net Assets", "NAV")
Set results = New Collection
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
funds = Application.Transpose(ws.Range("A2:A3").Value) '<== Change the range here to the single column range containing your dotNums.
For i = LBound(funds) To UBound(funds)
If Not IsEmpty(funds(i)) Then
url = BASE_URL & funds(i)
html.body.innerHTML = GetString(url)
results.Add GetInfo(html).Items
End If
Next
If results.Count > 0 Then
Dim item As Variant, r As Long, c As Long
r = 2: c = 2
With ws
.Cells(1, c).Resize(1, UBound(headers) + 1) = headers
For Each item In results
.Cells(r, c).Resize(1, UBound(item) + 1) = item
r = r + 1
Next
End With
End If
Application.ScreenUpdating = True
End Sub
Public Function GetString(ByVal url As String) As String
Dim http As Object
Set http = CreateObject("MSXML2.XMLHTTP")
Dim sResponse As String
With http
.Open "GET", url, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
GetString = sResponse
End With
End Function
Public Function GetInfo(ByVal html As HTMLDocument) As Object
Dim dict As Object, i As Long
Set dict = CreateObject("Scripting.Dictionary")
dict.Add "Open", vbNullString
dict.Add "Shares Outstanding", vbNullString
dict.Add "Total Net Assets", vbNullString
dict.Add "NAV", vbNullString
Dim values As Object, labels As Object
With html
Set values = .querySelectorAll(".kv__value.kv__primary")
Set labels = .querySelectorAll(".kv__label")
For i = 0 To labels.Length - 1
If dict.Exists(labels.item(i).innerText) Then dict(labels.item(i).innerText) = values.item(i).innerText
Next
End With
Set GetInfo = dict
End Function
Okay, so you will need to create two loops. You can just keep reusing the elem0, elem1, and elemColl(1) variables for each price point you need - just make sure to reset bFoundIt to False for each new iteration so you do not exit the For Loops early.
For your total_net_assets var, you will first loop the class of kv__item. You will then need to loop each class collection of kv__label within the kv__item's elements and stop when you match the innerText: Total Net Assets. Once you match this, you will use the first coll obj elem0 to get the kv__value kv__primary class name for it.
Dim IE As Object, elem0 As Object, elem1 As Object, i As Long, bFoundIt As Boolean
Set IE = CreateObject("InternetExplorer.application")
With IE
.Visible = False
.navigate link
While .Busy Or .readyState < 4: DoEvents: Wend
DoEvents
bFoundIt = False
For Each elem0 In .document.getElementsByClassName("kv__item")
For Each elem1 In elem0.getElementsByClassName("kv__label")
If elem1.innerText = "Total Net Assets" Then
bFoundIt = True
total_net_assets = elem0.getElementsByClassName("kv__value kv__primary ")(0).innerText
Exit For
End If
Next elem1
If bFoundIt Then Exit For
Next elem0

How to use appIE.Document.Body.innerHTML

So I'm trying to retrieve the latitude and longitude of a given postal code, and am trying to use VBA to place this into an excel worksheet. My code is as follows:
Private appIE As Object
Function GeoCode(sLocationData As String) As String
'//Dont want to open and close all day long - make once use many
If appIE Is Nothing Then
CreateIEApp
'// Creates a new IE App
'// if = nothing now then there was an error
If appIE Is Nothing Then
GeoCode = "Sorry could not launch IE"
Exit Function
Else
'// do nothing
End If
Else
'// do nothing
End If
'//clearing up input data
'sLocationData = Replace(sLocationData, ",", " ")
sLocationData = Replace(sLocationData, " ", "+")
sLocationData = Trim(sLocationData)
'//Build URL for Query
sLocationData = "http://maps.google.com/maps/geo?q=%20_" & sLocationData
'// go to the google web service and get the raw CSV data
'// CAUSES PROBLEM AS SPECIFIED BELOW
appIE.Navigate sLocationData
Do While appIE.Busy
Application.StatusBar = "Contacting Google Maps API..."
Loop
Application.StatusBar = False
On Error Resume Next
'// Parsing
GeoCode = appIE.Document.Body.innerHTML
GeoCode = Mid(GeoCode, InStr(GeoCode, ",") + 1, InStr(GeoCode, "/") - InStr(GeoCode, ",") - 2)
appIE = Nothing
End Function
The Google Maps API then returns a JSON formatted value, as per this link:
http://maps.google.com/maps/geo?q=%20_400012
I then attempt to retrieve this value using
appIE.Document.Body.innerHTML,
and parsing that value for the data I want. However, the moment the code hits appIE.Navigate sLocationData,
I'm prompted to save a file called "geo". When saved and opened as a .txt file, I get the exact same JSON formatted value, but I need the values within my worksheet itself.
Is there a way to do this?
Thanks in advance!
That link didn't work for me in Firefox - response 610. If I remove the space and the underscore, it works. I don't know why IE wants to download, probably some setting that tells it to always download JSON rather than render it. In any case, consider using MSXML's http request rather than automating IE.
Set a reference to Microsoft XML, v6.0 or similar (VBE - Tools - References).
Function GeoCode(sLocData As String) As String
Dim xHttp As MSXML2.XMLHTTP
Dim sResponse As String
Dim lStart As Long, lEnd As Long
Const sURL As String = "http://maps.google.com/maps/geo?q="
Const sCOOR As String = "coordinates"": " 'substring that we'll look for later
'send the http request
Set xHttp = New MSXML2.XMLHTTP
xHttp.Open "GET", sURL & sLocData
xHttp.send
'wait until it's done
Do
DoEvents
Loop Until xHttp.readyState = 4
'get the returned data
sResponse = xHttp.responseText
'find the starting and ending points of the substring
lStart = InStr(1, sResponse, sCOOR)
lEnd = InStr(lStart, sResponse, "]")
GeoCode = Mid$(sResponse, lStart + Len(sCOOR), lEnd - lStart - Len(sCOOR) + 1)
End Function
Sub Test()
Dim sTest As String
sTest = GeoCode("400012")
Debug.Assert sTest = "[ 103.9041520, 1.3222160, 0 ]"
End Sub

Resources