Scrape economic data from investing website - excel

I am working on a code to get data from : https://www.investing.com/economic-calendar/core-durable-goods-orders-59
I have got the code for getting this via httprequest: but looking to change this to work for the economic data (link above) is there any way I can get the same for the economic indicators??
code below:
Option Explicit
Sub Export_Table()
'Html Objects---------------------------------------'
Dim htmlDoc As MSHTML.HTMLDocument
Dim htmlBody As MSHTML.htmlBody
Dim ieTable As MSHTML.HTMLTable
Dim Element As MSHTML.HTMLElementCollection
'Workbooks, Worksheets, Ranges, LastRow, Incrementers ----------------'
Dim wb As Workbook
Dim Table As Worksheet
Dim i As Long
Set wb = ThisWorkbook
Set Table = wb.Worksheets("Sheet1")
'-------------------------------------------'
Dim xmlHttpRequest As New MSXML2.XMLHTTP60 '
'-------------------------------------------'
i = 2
'Web Request --------------------------------------------------------------------------'
With xmlHttpRequest
.Open "POST", "https://www.investing.com/instruments/HistoricalDataAjax", False
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
.setRequestHeader "X-Requested-With", "XMLHttpRequest"
.send "curr_id=951681&smlID=1695217&header=CLNX+Historical+Data&st_date=01%2F01%2F2017&end_date=03%2F01%2F2019&interval_sec=Monthly&sort_col=date&sort_ord=DESC&action=historical_data"
If .Status = 200 Then
Set htmlDoc = CreateHTMLDoc
Set htmlBody = htmlDoc.body
htmlBody.innerHTML = xmlHttpRequest.responseText
Set ieTable = htmlDoc.getElementById("curr_table")
For Each Element In ieTable.getElementsByTagName("tr")
Table.Cells(i, 1) = Element.Children(0).innerText
Table.Cells(i, 2) = Element.Children(1).innerText
Table.Cells(i, 3) = Element.Children(2).innerText
Table.Cells(i, 4) = Element.Children(3).innerText
Table.Cells(i, 5) = Element.Children(4).innerText
Table.Cells(i, 6) = Element.Children(5).innerText
Table.Cells(i, 7) = Element.Children(6).innerText
i = i + 1
DoEvents: Next Element
End If
End With
Set xmlHttpRequest = Nothing
Set htmlDoc = Nothing
Set htmlBody = Nothing
Set ieTable = Nothing
Set Element = Nothing
End Sub
Public Function CreateHTMLDoc() As MSHTML.HTMLDocument
Set CreateHTMLDoc = CreateObject("htmlfile")
End Function

I have used the excel tool Power Query for this very thing. It is also called Get & Transform Data. I don't like using vba for doing this type of thing.
To make it work:
In Excel Go to Data>Get Data>From Other Sources>From Web.
Enter the URL
Wait for the webpage to load and then pick the table that you want.
This website took awhile to load, but it did work for me.
Choose "Load" which goes directly to the sheet, or "Transform Data" to manipulate the data in Power Query. There are many options in power query such as split columns, filter data, Calculate Columns and ...

I would avoid the overhead of setting up a permanent connection and simply continue using XHR. With the data > from web, you cannot grab more rows than are present on the initial landing. If however you go with XHR, you can issue POST requests to get more data. The code below utilizes a loop to retrieve additional results beyond the immediate visible on the page.
When you press the Show more link there is a POST request for an additional 6 rows which uses the latest date from the current set of results as part of the POST body. The response returned is JSON. Rather than bringing in a JSON parser, given the standard nature of the JSON, and that I am already using regex to clean the date format in column 1 to put in the POST body, I use two simple regexes to extract the html for the next results table from the response, and to check whether there are more results.
The format of the JSON is:
{
"historyRows": "<tr>…..</tr>",
"hasMoreHistory": "1"
}
Or
{
"historyRows": "<tr>…..</tr>",
"hasMoreHistory": false
}
So, I do some cleaning of the extracted html in order to not confuse the html parser within MSHTML. Furthermore, I add in an id to identify the table I have constructed, so I can continue to use an id css selector (#) list within my UpdateDateResults function.
I initially oversize an array to store each retrieved table which I update ByRef. I loop requesting more results until either there are no more results, there is an error parsing the maximum date from the last retrieved table column 1, or until my specified earliest date for data retrieval falls within the date range of the latest returned table.
Finally, I write the results array out to the sheet in one go.
N.B. You can target the table by its id. It looks like the number at the end of the id could be the same as for the goods url, lending itself to generalizing the code below to work for other goods.
VBA:
Option Explicit
Public Sub GetInvestingInfo()
'tools > references > Microsoft HTML Object Library
Dim html As MSHTML.HTMLDocument, xhr As Object
Const STARTDATE As Date = "2019-11-25" 'Adjust as required. DateAdd("yyyy", -2, Date) 2 years back. This means may have some earlier months in _
batch that spans the start date but won't issue an additional request after this
Set xhr = CreateObject("MSXML2.XMLHTTP")
Set html = New MSHTML.HTMLDocument
With xhr
.Open "GET", "https://www.investing.com/economic-calendar/core-durable-goods-orders-59", False
.setRequestHeader "User-Agent", "Safari/537.36"
.send
html.body.innerHTML = .responseText
End With
Dim firstTable As Boolean, r As Long, results() As Variant
ReDim results(1 To 100000, 1 To 5)
'process initial table and update results, get cleaned date needed for request for more results
firstTable = True
Dim latestDate As String
UpdateDateResults latestDate, results, firstTable, r, html
Dim re As Object, maxDate As String, hasMoreHistory As Boolean, s As String
Set re = CreateObject("VBScript.RegExp")
With re
.Global = True
.MultiLine = False
End With
maxDate = cleanedDate(latestDate, re)
hasMoreHistory = True
Dim errorDate As Date
errorDate = DateAdd("d", 1, Date)
Do While maxDate >= STARTDATE And maxDate < errorDate 'break loop using pre-defined earliest date, error with date conversion, or when no more rows found
Application.Wait (Now + TimeSerial(0, 0, 1)) 'Pause
s = GetMoreRows(xhr, Format$(maxDate, "YYYY-MM-DD")) 'max a POST request for more data
re.Pattern = "hasMoreHistory"":(""?.*?""?)}" 'Check if there are more rows still available. "1" for yes, false for no
hasMoreHistory = (re.Execute(s)(0).submatches(0) <> False)
If Not hasMoreHistory Then Exit Do
re.Pattern = "historyRows"":""(.*)"","
html.body.innerHTML = "<table id=""me"">" & Replace$(re.Execute(s)(0).submatches(0), "\/", "/") & "</table>" 'fix html and feed into html variable
UpdateDateResults latestDate, results, firstTable, r, html
maxDate = cleanedDate(latestDate, re) 'convert value retrieved from last row in date column of table to an actual date
Loop
With ActiveSheet
.Cells.ClearContents
.Cells(1, 1).Resize(r, 5) = results 'Don't bother to resize results as clear all cells before write ou
End With
End Sub
Public Sub UpdateDateResults(ByRef latestDate As String, ByRef results() As Variant, ByRef firstTable As Boolean, ByRef r As Long, ByVal html As MSHTML.HTMLDocument)
Dim table As MSHTML.HTMLTable 'return latest date from function
Set table = html.querySelector("#eventHistoryTable59, #me")
latestDate = table.Rows(table.Rows.Length - 1).Children(0).innerText
Dim i As Long, n As Long, j As Long
n = IIf(firstTable, 0, 1)
For i = n To table.Rows.Length - 1
r = r + 1
For j = 0 To table.Rows(i).Children.Length - 2
results(r, j + 1) = table.Rows(i).Children(j).innerText
Next
Next
firstTable = False
End Sub
Public Function cleanedDate(ByVal dirtyString As String, ByVal re As Object) As Date
re.Pattern = "(^[A-Z][a-z]{2}).*(\d{2}),.(\d{4})(.*)"
On Error GoTo errhand:
If re.test(dirtyString) Then
cleanedDate = CDate(re.Replace(dirtyString, "$2" & Chr$(32) & "$1" & Chr$(32) & "$3"))
Exit Function
End If
errhand:
cleanedDate = DateAdd("d", 1, Date)
End Function
Public Function GetMoreRows(ByVal xhr As Object, ByVal dateStamp As String) As String
With xhr
.Open "POST", "https://www.investing.com/economic-calendar/more-history", False
.setRequestHeader "User-Agent", "Safari/537.36"
.setRequestHeader "x-requested-with", "XMLHttpRequest"
.setRequestHeader "content-type", "application/x-www-form-urlencoded"
.send "eventID=430865&event_attr_ID=59&event_timestamp=" & dateStamp & "+" & Application.WorksheetFunction.EncodeURL("12:30:00") & "&is_speech=0"
GetMoreRows = .responseText
End With
End Function
Regexes (without the double " escaping for VBA):
hasMoreHistory":("?.*?"?)}
historyRows":"(.*)",

Related

Print each item in a JSON Object as separate row in excel file using VBA

I'm trying to read the json from an URL, and paste the json object items as a separate row in an excel sheet.
But till now, I'm able to fetch the entire json object into the excel in single column A1.
I tried to iterate through the json objects and print in separate rows.
But getting 424 Error : Object Qualifier
Need your help on the same.
Sample Data : {"data":{"id":3,"name":"true red","year":2002,"color":"#BF1932","pantone_value":"19-1664"},"support":{"url":"https://reqres.in/#support-heading","text":"To keep ReqRes free, contributions towards server costs are appreciated!"}}
Private Sub HTML_VBA_Extract_Data_From_Website_To_Excel()
Dim oXMLHTTP As Object
Dim sPageHTML As String
Dim sURL As String
'Change the URL before executing the code. URL to Extract data from.
sURL = "https://reqres.in/api/products/3"
'Extract data from website to Excel using VBA
Set oXMLHTTP = CreateObject("MSXML2.ServerXMLHTTP")
oXMLHTTP.Open "GET", sURL, False
oXMLHTTP.send
sPageHTML = oXMLHTTP.responseText
Dim jsonObject As Object
Set jsonObject = JsonConverter.ParseJson(sPageHTML)
Dim i As Integer
Dim Item As Variant
i = 2
For Each Item In jsonObject.Keys
ThisWorkbook.Sheets(1).Cells(i, 1).Value = Item
i = i + 1
Next
MsgBox "XMLHTML Fetch Completed"
End Sub
This outputs:
data
support
--------
id 3
name true red
year 2002
color #BF1932
pantone_value 19-1664
to the Immediate window, as expected.
Private Sub JsonTester()
Dim oXMLHTTP As Object
Dim sURL As String
Dim jsonObject As Object
Dim Item As Variant, data As Object
sURL = "https://reqres.in/api/products/3"
Set oXMLHTTP = CreateObject("MSXML2.ServerXMLHTTP")
oXMLHTTP.Open "GET", sURL, False
oXMLHTTP.send
Set jsonObject = JsonConverter.ParseJson(oXMLHTTP.responseText)
For Each Item In jsonObject.keys
Debug.Print Item
Next
Debug.Print "--------"
Set data = jsonObject("data")
For Each Item In data.keys
Debug.Print Item, data(Item)
Next
End Sub

URL Issue retrieving data quotes in Yahoo finance

The URL from Yahoo is not working when I try to retrieve quotes from a particular stock. There are several discussion about it, However, it seems nothing is shown regarding VBA macro
Sub Get_Data()
Dim URL As String
Dim Ticker As String
Dim http As New WinHttpRequest
Dim sCotes As String
Dim Lignes
Dim Valeurs
Dim i As Long
Dim j As Long
Dim sLigne As String
Dim sValeur As String
Ticker = Range("Ticker")
URL = "https://query1.finance.yahoo.com/v7/finance/download/TECK?period1=1540456339&period2=1571992339&interval=1d&events=history&crumb=kjOZLFv6ch2"
http.Send
sCotes = http.ResponseText
MsgBox sCotes
Lignes = Split(sCotes, Chr(10))
For i = 1 To UBound(Lignes) 'until the end of the Lignes variable
sLigne = Lignes(i)
Valeurs = Split(sLigne, ",")
For j = 0 To UBound(Valeurs) - 1
Select Case j
Case 0
sValeur = DateSerial(CLng(Left(Valeurs(0), 4)), CLng(Mid(Valeurs(0), 6, 2)), CLng(Right(Valeurs(0), 2)))
Case 5
sValeur = CLng(Valeurs(5))
Case Else
sValeur = CDbl(Replace(Valeurs(j), ".", ","))
End Select
Range("A1").Offset(i, j) = sValeur
Application.StatusBar = Format(Cells(i, 1), "Short Date")
Next
Next
Application.StatusBar = False
End Sub
Execution error at the step Http.send : "This method cannot be called until the Open method has been called"
You would need to use the "open" method before attempting to send and GET is perfectly fine. However, a few things....
There is an easier way. The headers worth adding are the User-Agent and one to mitigate being served cached results. The following shows you how to get a json response from the server for a specified time period and write to Excel. Note: You need to concatenate the ticker into the url. You should probably also test the response code from server to ensure successful.
I use jsonconverter.bas as the json parser to handle response. Download raw code from here and add to standard module called JsonConverter . You then need to go VBE > Tools > References > Add reference to Microsoft Scripting Runtime. Remove the top Attribute line from the copied code.
The values for startDate and endDate need to be passed as unix timestamps. #TimWilliams wrote a nice function, toUnix, for converting Date to Unix here which I use. I have added my own function to manage the conversion in the opposite direction.
This method avoids using any session based identifiers so avoids your issue with invalid cookie crumb.
VBA:
Option Explicit
Public Sub GetYahooHistoricData()
Dim ticker As String, ws As Worksheet, url As String, s As String
Dim startDate As Long, endDate As Long
Set ws = ThisWorkbook.Worksheets("Sheet1")
ticker = ws.Range("ticker") 'Range A1. Above write out range
endDate = toUnix("2019-10-27")
startDate = toUnix("2018-10-25")
url = "https://query1.finance.yahoo.com/v8/finance/chart/" & ticker & "?region=US&lang=en-US&includePrePost=false&interval=1d&period1=" & startDate & "&period2=" & endDate & "&corsDomain=finance.yahoo.com&.tsrc=finance"
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", url, False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
s = .responseText
End With
Dim json As Object
Set json = JsonConverter.ParseJson(s)("chart")("result")
Dim dates As Object, results(), rows As Object, adjClose As Object, r As Long, headers()
headers = Array("date", "close", "volume", "open", "high", "low", "adjclose")
Set dates = json(1)("timestamp")
ReDim results(1 To dates.Count, 1 To UBound(headers) + 1)
Set rows = json(1)("indicators")("quote")(1)
Set adjClose = json(1)("indicators")("adjclose")(1)("adjclose")
For r = 1 To dates.Count
results(r, 1) = GetDate(dates(r))
results(r, 2) = rows("close")(r)
results(r, 3) = rows("volume")(r)
results(r, 4) = rows("open")(r)
results(r, 5) = rows("high")(r)
results(r, 6) = rows("low")(r)
results(r, 7) = adjClose(r)
Next
With ws
.Cells(3, 1).Resize(1, UBound(headers) + 1) = headers
.Cells(4, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Public Function GetDate(ByVal t As Variant) As String
GetDate = Format$(t / 86400 + DateValue("1970-01-01"), "yyyy-mm-dd")
End Function
Public Function toUnix(ByVal dt As Variant) As Long
toUnix = DateDiff("s", "1/1/1970", dt)
End Function
Example top 10 rows:
Try replacing this code
URL = "https://query1.finance.yahoo.com/v7/finance/download/TECK?period1=1540456339&period2=1571992339&interval=1d&events=history&crumb=kjOZLFv6ch2"
http.Send
with this code:
set http = Server.Createobject("MSXML2.ServerXMLHTTP.6.0")
URL = "https://query1.finance.yahoo.com/v7/finance/download/TECK?period1=1540456339&period2=1571992339&interval=1d&events=history&crumb=kjOZLFv6ch2"
http.open "POST", URL, False
http.Send
The error is pretty clear: you need to call the open method before the Send method. Also this would be a POST request. You also may need to put these two lines after the open method:
http.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
http.setRequestHeader "Content-Length", 0
The question is about 99% duplicate as the one from here - How can I send an HTTP POST request to a server from Excel using VBA?. Anyway, the mistake is obviously, because the .Send() method simply sends a completely empty Dim http As New WinHttpRequest object.
To make the code work, copy the example from the duplcated question and print the http.ResponseText:
Sub TestMe()
Dim http As Object
Dim url As String
Set http = CreateObject("MSXML2.ServerXMLHTTP.6.0")
url = "https://query1.finance.yahoo.com/v7/finance/download/TECK?period1=1540456339&period2=1571992339&interval=1d&events=history&crumb=kjOZLFv6ch2"
http.Open "POST", url, False
http.Send
MsgBox http.responsetext
End Sub

Is there a way to slow down a Web Scraper so it will pick up the code?

I wrote a macro to go to WU to get historical data and for the most part, it works. However, I believe that the macro is running too fast for it to pick up the data from the website.
https://www.wunderground.com/history/daily/us/tx/el-paso/KELP/date/2017-1-3
Is the website and the table I want to get is tablesaw-sortable.
I have tried the following: DoEvents and Application.Wait (Now + TimeValue("00:00:01")) to try to slow down the process.
Sub BrowseToWU()
Dim IE As New SHDocVw.InternetExplorer
Dim HTMLDoc As MSHTML.HTMLDocument
Dim RowAddress As Integer
Dim WebAddress As String
Dim DateSheet As Date
Dim WkDay As Integer
Dim DateSheetName As String
'Application.ScreenUpdating = False
'Application.StatusBar = True
RowAddress = 2
IE.Visible = True
Do Until RowAddress = 60
WebAddress = Range("A" & RowAddress)
DateSheet = Right(WebAddress, 8)
DateSheetName = Right(WebAddress, 8)
WkDay = Weekday(DateSheet, vbSunday)
If WkDay < 3 Then
RowAddress = RowAddress + 1
ElseIf WkDay > 6 Then
RowAddress = RowAddress + 1
Else
IE.Navigate WebAddress
Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop
Set HTMLDoc = IE.Document
DoEvents
Application.Wait (Now + TimeValue("00:00:05"))
DoEvents
ProcessHTMLPage HTMLDoc
DateSheet = Right(WebAddress, 8)
DoEvents
Application.Wait (Now + TimeValue("00:00:01"))
ActiveSheet.Name = DateSheetName
DoEvents
RowAddress = RowAddress + 1
'IE.Quit
Worksheets("Sheet1").Activate
End If
Loop
End Sub
Option Explicit
Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)
Dim HTMLTable As MSHTML.IHTMLElement
Dim HTMLTables As MSHTML.IHTMLElementCollection
Dim HTMLRow As MSHTML.IHTMLElement
Dim HTMLCell As MSHTML.IHTMLElement
Dim RowNum As Long, ColNum As Integer
'Dim IE As New SHDocVw.InternetExplorer
'Dim Ws As Worksheet
Set HTMLTables = HTMLPage.getElementsByClassName("tablesaw-sortable")
'DoEvents
For Each HTMLTable In HTMLTables
Worksheets.Add
DoEvents
Range("A1").Value = HTMLTable.className
Range("B1").Value = Now
RowNum = 2
For Each HTMLRow In HTMLTable.getElementsByTagName("tr")
'Debug.Print vbTab & HTMLRow.innerText
ColNum = 1
For Each HTMLCell In HTMLRow.Children
Cells(RowNum, ColNum) = HTMLCell.innerText
ColNum = ColNum + 1
Next HTMLCell
RowNum = RowNum + 1
Next HTMLRow
Next HTMLTable
DoEvents
'IE.Quit
End Sub
The macro is supposed to run through sheet1 picking up the web address to the historical data if it satisfies the criteria of being a certain day of the week.
IE will open and then it will kick over to the next module that will take in the data.
A new worksheet is created and the data pasted into the new worksheet.
The worksheet is renamed to the date of the data.
The web address sheet is activated again and the process starts over again.
The error I get is that the data isn't taken from the website, so the For statement ends and the web address sheet is renamed and an error occurs.
One way around this is to call the API that the page is using to get that info.
The API returns json which you can parse with a json parser. I use jsonconverter.bas. After installing the code from that link in a standard module called JsonConverter, go to VBE > Tools > References > Add a reference to Microsoft Scripting Runtime.
Finding the API:
If you press F12 to open developer tools and go to the Network tab and then press F5 to refresh any url of interest you will see the recorded web traffic. You can find the API call there.
See my answer here on how to search the network traffic using a specific observation value you expect to see in the response - this will filter the list of network traffic to those items containing the value of interest. Be sensible in selecting the value - you want something unlikely to occur elsewhere. You can also filter the network traffic to XHR only.
The API response:
The API returns json. More specifically, it returns a dictionary containing 2 keys. The second key, "observations", can be used to return a collection (denoted by []) of dictionaries (denoted by {}).
Each dictionary represents a row of the table (daily observations). You can loop this collection, and then loop the inner dictionaries, to access the table row values and reconstruct the table by populating an array. Explore example json response here.
Explanation of json structure:
click here to enlarge
Explanation of code:
The code is broken down into a number of helper subs and functions, allocating certains tasks to each, to
make code easier to debug and follow, as well as better align with Object Oriented Programming Principles.
Overall the process is:
Gather urls for Worksheet("Sheet1"). Helper function GetAllUrls.
Process those urls and only retain the dates which correspond with Tue-Thur. These are kept as strings formatted as "yyyymmdd" so can be passed to API later. This is handled by helper functions GetOnlyQualifyingUrlsDates and IncludeThisDate. IncludeThisDate performs the check for whether to include; GetOnlyQualifyingUrlsDates handles the looping and formatting of results.
Issue xmlhttp requests by looping over qualifying url dates and concatenating those into the url for the API call, then issuing the request. This is performed by the main sub GetTables.
Sheet creation, for output, is handled by helper function CreateWorksheet. This function calls another helper function, SheetExists, to ensure sheets are only created if they don't already exist, otherwise, the existing sheet by that name is used.
The resultant json response, from step 3, is passed to a helper sub WriteOutResults which accepts the json variable and the output sheet object as arguments. It extracts all the info from the json response; essentially reconstructing the table. It adds the table and headers to the appropriate sheet.
It calls helper function Epoch2Date, which handles the unix timestamp to datetime conversion for the two unix fields in the json object.
TODO:
The API key may be time limited. Add a helper function which returns the current valid key.
The API accepts start date and end date parameters in the url construct. It would be far better to issue one request for the entire range if possible, or chunked ranges e.g. months, to reduce the number of requests made. This would also reduce the likelihood of being blocked. This would mean some additional code would need to be written, before writing out results, to ensure only dates of interest are being written to sheets. Though you could write out all then simply loop all sheets and delete those that aren't wanted (perfectly doable if we are talking about 365 dates total). Personally, I would handle the include date part in the construction of the table from a single request (if possible) that has the min and max dates for entire urls listed passed as start and end date parameters. I would then write a single flat table out to one sheet as this will be much easier for later data analysis.
VBA:
Option Explicit
Public Sub GetTables()
'VBE > Tools > References > Microsoft Scripting Runtime
Dim json As Object, qualifyingUrlsDates(), urls(), url As String
Dim ws As Worksheet, wsOutput As Worksheet, i As Long, startDate As String, endDate As String
Set ws = ThisWorkbook.Worksheets("Sheet1")
urls = GetAllUrls(2, ws, "A")
qualifyingUrlsDates = GetOnlyQualifyingUrlsDates(urls)
'API key may be not be valid over time so look at obtaining by prior request
With CreateObject("MSXML2.XMLHTTP") 'issue xmlhttp request for each valid date (this would be better done using start and enddate to specify entire range _
of batches e.g. months within total range to cut down on requests
For i = LBound(qualifyingUrlsDates) To UBound(qualifyingUrlsDates)
startDate = qualifyingUrlsDates(i)
endDate = startDate ' a little verbose but useful for explaining
url = "https://api.weather.com/v1/geocode/31.76/-106.49/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&startDate=" & startDate & "&endDate=" & endDate & "&units=e"
.Open "GET", url, False
.send
Set json = JsonConverter.ParseJson(.responseText)("observations")
Set wsOutput = CreateWorksheet(qualifyingUrlsDates(i))
WriteOutResults wsOutput, json
Next
End With
End Sub
Public Sub WriteOutResults(ByVal wsOutput As Worksheet, ByVal json As Object)
'json is a collection of dictionaries. Each dictionary is a time period reading from the day i.e. one row in output
Dim results(), item As Object, headers(), r As Long, c As Long, key As Variant
headers = json.item(1).keys 'get the headers which are the keys of each dictionary
ReDim results(1 To json.Count, 1 To UBound(headers) + 1)
For Each item In json
r = r + 1: c = 0 'increase row in results array to store results for table row
For Each key In item.keys
c = c + 1 'increase column number in results array for writing out results
Select Case key
Case "valid_time_gmt", "expire_time_gmt" 'convert unix timestamp fields to datetime
results(r, c) = Epoch2Date(item(key))
Case Else
results(r, c) = item(key)
End Select
Next
Next
With wsOutput
.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
End Sub
Public Function GetOnlyQualifyingUrlsDates(ByVal urls As Variant) As Variant
Dim i As Long, output(), counter As Long
ReDim output(1 To UBound(urls))
For i = LBound(urls) To UBound(urls)
If IncludeThisDate(urls(i)) Then 'check if weekday is to be included
counter = counter + 1
output(counter) = Format$(Right$(urls(i), 8), "yyyymmdd") 'if to include then add to output array of urls of interest
End If
Next
ReDim Preserve output(1 To counter)
GetOnlyQualifyingUrlsDates = output
End Function
Public Function IncludeThisDate(ByVal url As String) As Boolean
'tue, wed, thurs are valid
IncludeThisDate = Not IsError(Application.Match(Weekday(Right$(url, 8), vbSunday), Array(3, 4, 5)))
End Function
Public Function SheetExists(ByVal sheetName As String) As Boolean '<== function by #Rory
SheetExists = Evaluate("ISREF('" & sheetName & "'!A1)")
End Function
Public Function GetAllUrls(ByVal startRow As Long, ByVal ws As Worksheet, ByVal columnName As String) As Variant
'transpose used based on premise no more than a couple of years of dates
'startRow is start row for urls, ws is sheet where urls found, columnName is string representation of column for urls e.g. "A"
With ws
GetAllUrls = Application.Transpose(ws.Range("A" & startRow & ":A" & .Cells(.rows.Count, columnName).End(xlUp).Row).Value)
End With
End Function
Public Function CreateWorksheet(ByVal sheetName As String) As Worksheet
Dim ws As Worksheet
If SheetExists(sheetName) Then
Set ws = ThisWorkbook.Worksheets(sheetName)
'do something.... clear it? Then add new data to it?
Else
Set ws = ThisWorkbook.Worksheets.Add
ws.Name = sheetName
End If
Set CreateWorksheet = ws
End Function
Public Function Epoch2Date(ByVal E As Currency, Optional msFrac) As Date '# Schmidt http://www.vbforums.com/showthread.php?805245-EPOCH-to-Date-and-vice-versa
Const Estart As Double = #1/1/1970#
msFrac = 0
If E > 10000000000# Then E = E * 0.001: msFrac = E - Int(E)
Epoch2Date = Estart + (E - msFrac) / 86400
End Function

how to get inner text of html under id?

I am trying to pull data pull inner text under id in excel cell.
This is for XML code.
Sub getelementbyid()
Dim XMLpage As New MSXML2.XMLHTTP60
Dim hdoc As New MSHTML.HTMLDocument
Dim HBEs As MSHTML.IHTMLElementCollection
Dim HBE As MSHTML.IHTMLElement
Dim ha As String
XMLpage.Open "GET","https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=HAL", False
XMLpage.send
hdoc.body.innerHTML = XMLpage.responseText
ha = hdoc.getelementbyid("open").innerText
Range("K11").Value = ha
Debug.Print ha
End Sub
I expect output value, but it shows --.
Examine the response text. There is a difference in the way the page is rendered in the browser versus what is returned in the ResponseText.
I put the URL into a browser went into dev tools (F12), found the element, and noted the numeric value inside the HTML element.
Then I dumped the response text we're getting in VBA into a cell and copied the entire cell value into Notepad++. If you do that you'll see the initial value inside the #open element is indeed "--".
The real value appears to be getting written into the HTML via JavaScript, which is common practice. There is a JSON object at the top of the page, presumably injected into the document from the back-end of the website upon your request.
So you have to parse the JSON, not the HTML. I've provided code doing just that. Now, there may be a better way to do it, I feel this code is kind of "hacky" but it's getting the job done for your example URL.
Sub getelementbyid()
Dim XMLpage As New MSXML2.XMLHTTP60
Dim hdoc As New MSHTML.HTMLDocument
Dim HBEs As MSHTML.IHTMLElementCollection
Dim HBE As MSHTML.IHTMLElement
Dim ha As String
XMLpage.Open "GET", "https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=HAL", False
XMLpage.send
'// sample: ,"open":"681.05",
Dim token As String
token = """open"":"""
Dim startPosition As Integer
startPosition = InStr(1, XMLpage.responseText, token)
Dim endPosition As Integer
endPosition = InStr(startPosition, XMLpage.responseText, ",")
Dim prop As String
prop = Mid(XMLpage.responseText, startPosition, endPosition - startPosition)
prop = Replace(prop, """", vbNullString)
prop = Replace(prop, "open:", vbNullString)
Dim val As Double
val = CDbl(prop)
ha = val
Range("K11").Value = ha
Debug.Print ha
End Sub
Here are two methods. 1) Using regex on the return text. Usually frowned upon but perfectly serviceable here. 2) Traditional extract json string and use json parser to parse out value.
The data you want is stored in a json string found both on the webpage and the xmlhtttp response, under the same element:
This means you can treat the html as a string and target just the pattern for the open price using regex as shown below, or parse the xmlhttp request into an html parser, grab the required element, extract its innerText and trim off the whitespace, then pass to a json parser to extract the open price.
In both methods you want to avoid being served cached results so the following header is an important addition to attempt to mitigate for this:
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
There is no need for addtional cell formatting. Full value comes out for both your tickers.
Regex:
It is present in a json string in the response. You can regex it out easily from return text.
Regex explanation:
VBA:
Option Explicit
Public Sub GetClosePrice()
Dim ws As Worksheet, re As Object, p As String, r As String
Set ws = ThisWorkbook.Worksheets("Sheet1")
p = """open"":""(.*?)"""
Set re = CreateObject("VBScript.RegExp")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=HAL", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
If .Status = 200 Then
r = GetValue(re, .responseText, p)
Else
r = "Failed connection"
End If
End With
ws.Range("K11").Value = r
End Sub
Public Function GetValue(ByVal re As Object, ByVal inputString As String, ByVal pattern As String) As String
With re
.Global = True
.pattern = pattern
If .test(inputString) Then ' returns True if the regex pattern can be matched agaist the provided string
GetValue = .Execute(inputString)(0).submatches(0)
Else
GetValue = "Not found"
End If
End With
End Function
HTML and json parser:
This requires installing code for jsonparser from jsonconverter.bas in a standard module called JsonConverter and then going VBE>Tools>References>Add a reference to Microsoft Scripting Runtime and Microsoft HTML Object Library.
VBA:
Option Explicit
Public Sub GetClosePrice()
Dim ws As Worksheet, re As Object, r As String, json As Object
Set ws = ThisWorkbook.Worksheets("Sheet1")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=MRF", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
If .Status = 200 Then
Dim html As HTMLDocument
Set html = New HTMLDocument
html.body.innerHTML = .responseText
Set json = JsonConverter.ParseJson(Trim$(html.querySelector("#responseDiv").innerText))
r = json("data")(1)("open")
Else
r = "Failed connection"
End If
End With
ws.Range("K11").Value = r
End Sub

Web Scraping ETFs Daily Data VBA

I'm trying to web scrape some daily info of differents ETFs. I found that https://www.marketwatch.com/ have a accurate info.
The most relevant info is the open Price, outstanding shares, NAV, total assets of the ETF.
Here is the link for IVV US Equity: https://www.marketwatch.com/investing/fund/ivv
I have web scraped with VBA before but the HTML of the pages I had used are different, I don't know if this is because some values of the ETFs (such as Price and Taded Volume) change constantly.
The idea is to create a code to extract relevant info and create a data base to analyze Macroeconomics factor using the ETFs as market indicators of flows between countries, regions, etc...
Mi first approach would be with VBA but after I get more into the data I would like to try with Python (after I get more conffident with it) to automate the webscraping process on a daily basis.
I am open to any suggestion or any other website that could be useful (I have tried with Yahoo Finance and Morningstar and I get the same problema with the HTML code).
This is my poor code:
Sub Get_Data()
Dim ticker As String, enlace As String
ticker = ThisWorkbook.Worksheets("ETFs").Cells(2, 2).Value 'IVV
'link = "https://www.morningstar.com/etfs/arcx/" & ticker & "/quote.html"
'link = "https://finance.yahoo.com/quote/" & ticker & "?p=" & ticker
link = "https://www.marketwatch.com/investing/fund/" & ticker
Application.ScreenUpdating = False
Dim x As Integer
x = ThisWorkbook.Worksheets("ETFs").Cells(Rows.Count, 1).End(xlUp).Row
'Dim i As Integer
'For i = 2 To x
Dim total_net_assets As Variant, open_price As Variant, NAV As Variant, shares_out
Set ie = CreateObject("InternetExplorer.application")
With ie
.Visible = False
.navigate link
While .Busy Or .readyState < 4: DoEvents: Wend
Do
DoEvents
On Error Resume Next
' Here is where I get the problem of not knowing how to reference the values I need because the class name appears repeatedly
total_net_assets = .document.getElementsByClassName("").Value
open_price = .document.getElementByClassName("price").Value
NAV = .document.getElementByClassName("").Value
shares_out = .document.getElementByClassName("kv__value kv__primary ").Value
On Error GoTo 0
Loop
End With
ThisWorkbook.Worksheets("ETFs").Cells(2, 13).Value = total_net_assets
ThisWorkbook.Worksheets("ETFs").Cells(2, 14).Value = NAV
ThisWorkbook.Worksheets("ETFs").Cells(2, 15).Value = open_price
ThisWorkbook.Worksheets("ETFs").Cells(2, 16).Value = shares_out
ie.Quit
'Next i
Application.ScreenUpdating = True
End Sub
Access method:
I use XMLHTTP requests as much faster than opening IE.
Code notes:
The following reads in fund short codes from Sheet1 column A, starting in A2, into an array. You can easily extend this adding more funds into column A.
This array is looped issuing XMLHTTP requests by concatenating the fund code into the BASE_URL variable.
I use a class, clsHTTP, to hold the XMLHTTP object to be efficient - no need to keep creating and destroying the object.
I provide this class with two methods. One to retrieve the target page innerHTML (GetString), and the other to extract the required info if available (GetInfo). I use a dictionary to test if the searched for labels are present. If present I grab the associated value. If not, I have a placeholder vbNullString in the dictionary.
I add each scraped result into a collection called results. At the end I loop this writing out to the sheet. By keeping most of the work in memory this provides for much faster scraping.
Retrieving info from HTML:
The labels e.g. Open, and values come in pairs.
You can generate a nodeList (think collection as with getElementsByClassName) by using querySelectorAll method to apply a class CSS selector to gather the label elements by their class name kv__label. The "." is the class selector.
Set labels = .querySelectorAll(".kv__label") '<== nodeList of labels
You do the same to get the associated values:
Set values = .querySelectorAll(".kv__value.kv__primary") '<== nodeList of associated values. Same length as labels nodeList so can use same index to retrieve associated label/value pairs from each nodeList.
You loop the labels using the dictionary in the clsHTTP method .GetInfo to see if you searched for labels are present, if they are, the associated value is retrieved from values by using the same index as where the label was found in the nodeList labels, and the dictionary vbNullString value for that label is updated with the actual retrieved value, else it is left as vbNullString.
Sample results:
VBA:
Class module clsHTTP:
Option Explicit
Private http As Object
Private Sub Class_Initialize()
Set http = CreateObject("MSXML2.XMLHTTP")
End Sub
Public Function GetString(ByVal url As String) As String
Dim sResponse As String
With http
.Open "GET", url, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
GetString = sResponse
End With
End Function
Public Function GetInfo(ByVal html As HTMLDocument) As Object
Dim dict As Object, i As Long
Set dict = CreateObject("Scripting.Dictionary")
dict.Add "Open", vbNullString
dict.Add "Shares Outstanding", vbNullString
dict.Add "Total Net Assets", vbNullString
dict.Add "NAV", vbNullString
Dim values As Object, labels As Object
With html
Set values = .querySelectorAll(".kv__value.kv__primary")
Set labels = .querySelectorAll(".kv__label")
For i = 0 To labels.Length - 1
If dict.Exists(labels.item(i).innerText) Then dict(labels.item(i).innerText) = values.item(i).innerText
Next
End With
Set GetInfo = dict
End Function
Standard module 1:
Option Explicit
Public Sub GetFundInfo()
Dim sResponse As String, html As HTMLDocument, http As clsHTTP, i As Long
Dim headers(), funds(), url As String, results As Collection, ws As Worksheet
Const BASE_URL As String = "https://www.marketwatch.com/investing/fund/"
Application.ScreenUpdating = False
headers = Array("Open", "Shares Outstanding", "Total Net Assets", "NAV")
Set results = New Collection
Set http = New clsHTTP
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
funds = Application.Transpose(ws.Range("A2:A3").Value) '<== Change the range here to the single column range containing your dotNums.
For i = LBound(funds) To UBound(funds)
If Not IsEmpty(funds(i)) Then
url = BASE_URL & funds(i)
html.body.innerHTML = http.GetString(url)
results.Add http.GetInfo(html).Items
End If
Next
If results.Count > 0 Then
Dim item As Variant, r As Long, c As Long
r = 2: c = 2
With ws
.Cells(1, c).Resize(1, UBound(headers) + 1) = headers
For Each item In results
.Cells(r, c).Resize(1, UBound(item) + 1) = item
r = r + 1
Next
End With
End If
Application.ScreenUpdating = True
End Sub
Set-up:
Without using a class:
Option Explicit
Public Sub GetFundInfo()
Dim sResponse As String, html As HTMLDocument, i As Long
Dim headers(), funds(), url As String, results As Collection, ws As Worksheet
Const BASE_URL As String = "https://www.marketwatch.com/investing/fund/"
Application.ScreenUpdating = False
headers = Array("Open", "Shares Outstanding", "Total Net Assets", "NAV")
Set results = New Collection
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
funds = Application.Transpose(ws.Range("A2:A3").Value) '<== Change the range here to the single column range containing your dotNums.
For i = LBound(funds) To UBound(funds)
If Not IsEmpty(funds(i)) Then
url = BASE_URL & funds(i)
html.body.innerHTML = GetString(url)
results.Add GetInfo(html).Items
End If
Next
If results.Count > 0 Then
Dim item As Variant, r As Long, c As Long
r = 2: c = 2
With ws
.Cells(1, c).Resize(1, UBound(headers) + 1) = headers
For Each item In results
.Cells(r, c).Resize(1, UBound(item) + 1) = item
r = r + 1
Next
End With
End If
Application.ScreenUpdating = True
End Sub
Public Function GetString(ByVal url As String) As String
Dim http As Object
Set http = CreateObject("MSXML2.XMLHTTP")
Dim sResponse As String
With http
.Open "GET", url, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
GetString = sResponse
End With
End Function
Public Function GetInfo(ByVal html As HTMLDocument) As Object
Dim dict As Object, i As Long
Set dict = CreateObject("Scripting.Dictionary")
dict.Add "Open", vbNullString
dict.Add "Shares Outstanding", vbNullString
dict.Add "Total Net Assets", vbNullString
dict.Add "NAV", vbNullString
Dim values As Object, labels As Object
With html
Set values = .querySelectorAll(".kv__value.kv__primary")
Set labels = .querySelectorAll(".kv__label")
For i = 0 To labels.Length - 1
If dict.Exists(labels.item(i).innerText) Then dict(labels.item(i).innerText) = values.item(i).innerText
Next
End With
Set GetInfo = dict
End Function
Okay, so you will need to create two loops. You can just keep reusing the elem0, elem1, and elemColl(1) variables for each price point you need - just make sure to reset bFoundIt to False for each new iteration so you do not exit the For Loops early.
For your total_net_assets var, you will first loop the class of kv__item. You will then need to loop each class collection of kv__label within the kv__item's elements and stop when you match the innerText: Total Net Assets. Once you match this, you will use the first coll obj elem0 to get the kv__value kv__primary class name for it.
Dim IE As Object, elem0 As Object, elem1 As Object, i As Long, bFoundIt As Boolean
Set IE = CreateObject("InternetExplorer.application")
With IE
.Visible = False
.navigate link
While .Busy Or .readyState < 4: DoEvents: Wend
DoEvents
bFoundIt = False
For Each elem0 In .document.getElementsByClassName("kv__item")
For Each elem1 In elem0.getElementsByClassName("kv__label")
If elem1.innerText = "Total Net Assets" Then
bFoundIt = True
total_net_assets = elem0.getElementsByClassName("kv__value kv__primary ")(0).innerText
Exit For
End If
Next elem1
If bFoundIt Then Exit For
Next elem0

Resources