Web Scraping ETFs Daily Data VBA - excel

I'm trying to web scrape some daily info of differents ETFs. I found that https://www.marketwatch.com/ have a accurate info.
The most relevant info is the open Price, outstanding shares, NAV, total assets of the ETF.
Here is the link for IVV US Equity: https://www.marketwatch.com/investing/fund/ivv
I have web scraped with VBA before but the HTML of the pages I had used are different, I don't know if this is because some values of the ETFs (such as Price and Taded Volume) change constantly.
The idea is to create a code to extract relevant info and create a data base to analyze Macroeconomics factor using the ETFs as market indicators of flows between countries, regions, etc...
Mi first approach would be with VBA but after I get more into the data I would like to try with Python (after I get more conffident with it) to automate the webscraping process on a daily basis.
I am open to any suggestion or any other website that could be useful (I have tried with Yahoo Finance and Morningstar and I get the same problema with the HTML code).
This is my poor code:
Sub Get_Data()
Dim ticker As String, enlace As String
ticker = ThisWorkbook.Worksheets("ETFs").Cells(2, 2).Value 'IVV
'link = "https://www.morningstar.com/etfs/arcx/" & ticker & "/quote.html"
'link = "https://finance.yahoo.com/quote/" & ticker & "?p=" & ticker
link = "https://www.marketwatch.com/investing/fund/" & ticker
Application.ScreenUpdating = False
Dim x As Integer
x = ThisWorkbook.Worksheets("ETFs").Cells(Rows.Count, 1).End(xlUp).Row
'Dim i As Integer
'For i = 2 To x
Dim total_net_assets As Variant, open_price As Variant, NAV As Variant, shares_out
Set ie = CreateObject("InternetExplorer.application")
With ie
.Visible = False
.navigate link
While .Busy Or .readyState < 4: DoEvents: Wend
Do
DoEvents
On Error Resume Next
' Here is where I get the problem of not knowing how to reference the values I need because the class name appears repeatedly
total_net_assets = .document.getElementsByClassName("").Value
open_price = .document.getElementByClassName("price").Value
NAV = .document.getElementByClassName("").Value
shares_out = .document.getElementByClassName("kv__value kv__primary ").Value
On Error GoTo 0
Loop
End With
ThisWorkbook.Worksheets("ETFs").Cells(2, 13).Value = total_net_assets
ThisWorkbook.Worksheets("ETFs").Cells(2, 14).Value = NAV
ThisWorkbook.Worksheets("ETFs").Cells(2, 15).Value = open_price
ThisWorkbook.Worksheets("ETFs").Cells(2, 16).Value = shares_out
ie.Quit
'Next i
Application.ScreenUpdating = True
End Sub

Access method:
I use XMLHTTP requests as much faster than opening IE.
Code notes:
The following reads in fund short codes from Sheet1 column A, starting in A2, into an array. You can easily extend this adding more funds into column A.
This array is looped issuing XMLHTTP requests by concatenating the fund code into the BASE_URL variable.
I use a class, clsHTTP, to hold the XMLHTTP object to be efficient - no need to keep creating and destroying the object.
I provide this class with two methods. One to retrieve the target page innerHTML (GetString), and the other to extract the required info if available (GetInfo). I use a dictionary to test if the searched for labels are present. If present I grab the associated value. If not, I have a placeholder vbNullString in the dictionary.
I add each scraped result into a collection called results. At the end I loop this writing out to the sheet. By keeping most of the work in memory this provides for much faster scraping.
Retrieving info from HTML:
The labels e.g. Open, and values come in pairs.
You can generate a nodeList (think collection as with getElementsByClassName) by using querySelectorAll method to apply a class CSS selector to gather the label elements by their class name kv__label. The "." is the class selector.
Set labels = .querySelectorAll(".kv__label") '<== nodeList of labels
You do the same to get the associated values:
Set values = .querySelectorAll(".kv__value.kv__primary") '<== nodeList of associated values. Same length as labels nodeList so can use same index to retrieve associated label/value pairs from each nodeList.
You loop the labels using the dictionary in the clsHTTP method .GetInfo to see if you searched for labels are present, if they are, the associated value is retrieved from values by using the same index as where the label was found in the nodeList labels, and the dictionary vbNullString value for that label is updated with the actual retrieved value, else it is left as vbNullString.
Sample results:
VBA:
Class module clsHTTP:
Option Explicit
Private http As Object
Private Sub Class_Initialize()
Set http = CreateObject("MSXML2.XMLHTTP")
End Sub
Public Function GetString(ByVal url As String) As String
Dim sResponse As String
With http
.Open "GET", url, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
GetString = sResponse
End With
End Function
Public Function GetInfo(ByVal html As HTMLDocument) As Object
Dim dict As Object, i As Long
Set dict = CreateObject("Scripting.Dictionary")
dict.Add "Open", vbNullString
dict.Add "Shares Outstanding", vbNullString
dict.Add "Total Net Assets", vbNullString
dict.Add "NAV", vbNullString
Dim values As Object, labels As Object
With html
Set values = .querySelectorAll(".kv__value.kv__primary")
Set labels = .querySelectorAll(".kv__label")
For i = 0 To labels.Length - 1
If dict.Exists(labels.item(i).innerText) Then dict(labels.item(i).innerText) = values.item(i).innerText
Next
End With
Set GetInfo = dict
End Function
Standard module 1:
Option Explicit
Public Sub GetFundInfo()
Dim sResponse As String, html As HTMLDocument, http As clsHTTP, i As Long
Dim headers(), funds(), url As String, results As Collection, ws As Worksheet
Const BASE_URL As String = "https://www.marketwatch.com/investing/fund/"
Application.ScreenUpdating = False
headers = Array("Open", "Shares Outstanding", "Total Net Assets", "NAV")
Set results = New Collection
Set http = New clsHTTP
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
funds = Application.Transpose(ws.Range("A2:A3").Value) '<== Change the range here to the single column range containing your dotNums.
For i = LBound(funds) To UBound(funds)
If Not IsEmpty(funds(i)) Then
url = BASE_URL & funds(i)
html.body.innerHTML = http.GetString(url)
results.Add http.GetInfo(html).Items
End If
Next
If results.Count > 0 Then
Dim item As Variant, r As Long, c As Long
r = 2: c = 2
With ws
.Cells(1, c).Resize(1, UBound(headers) + 1) = headers
For Each item In results
.Cells(r, c).Resize(1, UBound(item) + 1) = item
r = r + 1
Next
End With
End If
Application.ScreenUpdating = True
End Sub
Set-up:
Without using a class:
Option Explicit
Public Sub GetFundInfo()
Dim sResponse As String, html As HTMLDocument, i As Long
Dim headers(), funds(), url As String, results As Collection, ws As Worksheet
Const BASE_URL As String = "https://www.marketwatch.com/investing/fund/"
Application.ScreenUpdating = False
headers = Array("Open", "Shares Outstanding", "Total Net Assets", "NAV")
Set results = New Collection
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
funds = Application.Transpose(ws.Range("A2:A3").Value) '<== Change the range here to the single column range containing your dotNums.
For i = LBound(funds) To UBound(funds)
If Not IsEmpty(funds(i)) Then
url = BASE_URL & funds(i)
html.body.innerHTML = GetString(url)
results.Add GetInfo(html).Items
End If
Next
If results.Count > 0 Then
Dim item As Variant, r As Long, c As Long
r = 2: c = 2
With ws
.Cells(1, c).Resize(1, UBound(headers) + 1) = headers
For Each item In results
.Cells(r, c).Resize(1, UBound(item) + 1) = item
r = r + 1
Next
End With
End If
Application.ScreenUpdating = True
End Sub
Public Function GetString(ByVal url As String) As String
Dim http As Object
Set http = CreateObject("MSXML2.XMLHTTP")
Dim sResponse As String
With http
.Open "GET", url, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
GetString = sResponse
End With
End Function
Public Function GetInfo(ByVal html As HTMLDocument) As Object
Dim dict As Object, i As Long
Set dict = CreateObject("Scripting.Dictionary")
dict.Add "Open", vbNullString
dict.Add "Shares Outstanding", vbNullString
dict.Add "Total Net Assets", vbNullString
dict.Add "NAV", vbNullString
Dim values As Object, labels As Object
With html
Set values = .querySelectorAll(".kv__value.kv__primary")
Set labels = .querySelectorAll(".kv__label")
For i = 0 To labels.Length - 1
If dict.Exists(labels.item(i).innerText) Then dict(labels.item(i).innerText) = values.item(i).innerText
Next
End With
Set GetInfo = dict
End Function

Okay, so you will need to create two loops. You can just keep reusing the elem0, elem1, and elemColl(1) variables for each price point you need - just make sure to reset bFoundIt to False for each new iteration so you do not exit the For Loops early.
For your total_net_assets var, you will first loop the class of kv__item. You will then need to loop each class collection of kv__label within the kv__item's elements and stop when you match the innerText: Total Net Assets. Once you match this, you will use the first coll obj elem0 to get the kv__value kv__primary class name for it.
Dim IE As Object, elem0 As Object, elem1 As Object, i As Long, bFoundIt As Boolean
Set IE = CreateObject("InternetExplorer.application")
With IE
.Visible = False
.navigate link
While .Busy Or .readyState < 4: DoEvents: Wend
DoEvents
bFoundIt = False
For Each elem0 In .document.getElementsByClassName("kv__item")
For Each elem1 In elem0.getElementsByClassName("kv__label")
If elem1.innerText = "Total Net Assets" Then
bFoundIt = True
total_net_assets = elem0.getElementsByClassName("kv__value kv__primary ")(0).innerText
Exit For
End If
Next elem1
If bFoundIt Then Exit For
Next elem0

Related

Scrape economic data from investing website

I am working on a code to get data from : https://www.investing.com/economic-calendar/core-durable-goods-orders-59
I have got the code for getting this via httprequest: but looking to change this to work for the economic data (link above) is there any way I can get the same for the economic indicators??
code below:
Option Explicit
Sub Export_Table()
'Html Objects---------------------------------------'
Dim htmlDoc As MSHTML.HTMLDocument
Dim htmlBody As MSHTML.htmlBody
Dim ieTable As MSHTML.HTMLTable
Dim Element As MSHTML.HTMLElementCollection
'Workbooks, Worksheets, Ranges, LastRow, Incrementers ----------------'
Dim wb As Workbook
Dim Table As Worksheet
Dim i As Long
Set wb = ThisWorkbook
Set Table = wb.Worksheets("Sheet1")
'-------------------------------------------'
Dim xmlHttpRequest As New MSXML2.XMLHTTP60 '
'-------------------------------------------'
i = 2
'Web Request --------------------------------------------------------------------------'
With xmlHttpRequest
.Open "POST", "https://www.investing.com/instruments/HistoricalDataAjax", False
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
.setRequestHeader "X-Requested-With", "XMLHttpRequest"
.send "curr_id=951681&smlID=1695217&header=CLNX+Historical+Data&st_date=01%2F01%2F2017&end_date=03%2F01%2F2019&interval_sec=Monthly&sort_col=date&sort_ord=DESC&action=historical_data"
If .Status = 200 Then
Set htmlDoc = CreateHTMLDoc
Set htmlBody = htmlDoc.body
htmlBody.innerHTML = xmlHttpRequest.responseText
Set ieTable = htmlDoc.getElementById("curr_table")
For Each Element In ieTable.getElementsByTagName("tr")
Table.Cells(i, 1) = Element.Children(0).innerText
Table.Cells(i, 2) = Element.Children(1).innerText
Table.Cells(i, 3) = Element.Children(2).innerText
Table.Cells(i, 4) = Element.Children(3).innerText
Table.Cells(i, 5) = Element.Children(4).innerText
Table.Cells(i, 6) = Element.Children(5).innerText
Table.Cells(i, 7) = Element.Children(6).innerText
i = i + 1
DoEvents: Next Element
End If
End With
Set xmlHttpRequest = Nothing
Set htmlDoc = Nothing
Set htmlBody = Nothing
Set ieTable = Nothing
Set Element = Nothing
End Sub
Public Function CreateHTMLDoc() As MSHTML.HTMLDocument
Set CreateHTMLDoc = CreateObject("htmlfile")
End Function
I have used the excel tool Power Query for this very thing. It is also called Get & Transform Data. I don't like using vba for doing this type of thing.
To make it work:
In Excel Go to Data>Get Data>From Other Sources>From Web.
Enter the URL
Wait for the webpage to load and then pick the table that you want.
This website took awhile to load, but it did work for me.
Choose "Load" which goes directly to the sheet, or "Transform Data" to manipulate the data in Power Query. There are many options in power query such as split columns, filter data, Calculate Columns and ...
I would avoid the overhead of setting up a permanent connection and simply continue using XHR. With the data > from web, you cannot grab more rows than are present on the initial landing. If however you go with XHR, you can issue POST requests to get more data. The code below utilizes a loop to retrieve additional results beyond the immediate visible on the page.
When you press the Show more link there is a POST request for an additional 6 rows which uses the latest date from the current set of results as part of the POST body. The response returned is JSON. Rather than bringing in a JSON parser, given the standard nature of the JSON, and that I am already using regex to clean the date format in column 1 to put in the POST body, I use two simple regexes to extract the html for the next results table from the response, and to check whether there are more results.
The format of the JSON is:
{
"historyRows": "<tr>…..</tr>",
"hasMoreHistory": "1"
}
Or
{
"historyRows": "<tr>…..</tr>",
"hasMoreHistory": false
}
So, I do some cleaning of the extracted html in order to not confuse the html parser within MSHTML. Furthermore, I add in an id to identify the table I have constructed, so I can continue to use an id css selector (#) list within my UpdateDateResults function.
I initially oversize an array to store each retrieved table which I update ByRef. I loop requesting more results until either there are no more results, there is an error parsing the maximum date from the last retrieved table column 1, or until my specified earliest date for data retrieval falls within the date range of the latest returned table.
Finally, I write the results array out to the sheet in one go.
N.B. You can target the table by its id. It looks like the number at the end of the id could be the same as for the goods url, lending itself to generalizing the code below to work for other goods.
VBA:
Option Explicit
Public Sub GetInvestingInfo()
'tools > references > Microsoft HTML Object Library
Dim html As MSHTML.HTMLDocument, xhr As Object
Const STARTDATE As Date = "2019-11-25" 'Adjust as required. DateAdd("yyyy", -2, Date) 2 years back. This means may have some earlier months in _
batch that spans the start date but won't issue an additional request after this
Set xhr = CreateObject("MSXML2.XMLHTTP")
Set html = New MSHTML.HTMLDocument
With xhr
.Open "GET", "https://www.investing.com/economic-calendar/core-durable-goods-orders-59", False
.setRequestHeader "User-Agent", "Safari/537.36"
.send
html.body.innerHTML = .responseText
End With
Dim firstTable As Boolean, r As Long, results() As Variant
ReDim results(1 To 100000, 1 To 5)
'process initial table and update results, get cleaned date needed for request for more results
firstTable = True
Dim latestDate As String
UpdateDateResults latestDate, results, firstTable, r, html
Dim re As Object, maxDate As String, hasMoreHistory As Boolean, s As String
Set re = CreateObject("VBScript.RegExp")
With re
.Global = True
.MultiLine = False
End With
maxDate = cleanedDate(latestDate, re)
hasMoreHistory = True
Dim errorDate As Date
errorDate = DateAdd("d", 1, Date)
Do While maxDate >= STARTDATE And maxDate < errorDate 'break loop using pre-defined earliest date, error with date conversion, or when no more rows found
Application.Wait (Now + TimeSerial(0, 0, 1)) 'Pause
s = GetMoreRows(xhr, Format$(maxDate, "YYYY-MM-DD")) 'max a POST request for more data
re.Pattern = "hasMoreHistory"":(""?.*?""?)}" 'Check if there are more rows still available. "1" for yes, false for no
hasMoreHistory = (re.Execute(s)(0).submatches(0) <> False)
If Not hasMoreHistory Then Exit Do
re.Pattern = "historyRows"":""(.*)"","
html.body.innerHTML = "<table id=""me"">" & Replace$(re.Execute(s)(0).submatches(0), "\/", "/") & "</table>" 'fix html and feed into html variable
UpdateDateResults latestDate, results, firstTable, r, html
maxDate = cleanedDate(latestDate, re) 'convert value retrieved from last row in date column of table to an actual date
Loop
With ActiveSheet
.Cells.ClearContents
.Cells(1, 1).Resize(r, 5) = results 'Don't bother to resize results as clear all cells before write ou
End With
End Sub
Public Sub UpdateDateResults(ByRef latestDate As String, ByRef results() As Variant, ByRef firstTable As Boolean, ByRef r As Long, ByVal html As MSHTML.HTMLDocument)
Dim table As MSHTML.HTMLTable 'return latest date from function
Set table = html.querySelector("#eventHistoryTable59, #me")
latestDate = table.Rows(table.Rows.Length - 1).Children(0).innerText
Dim i As Long, n As Long, j As Long
n = IIf(firstTable, 0, 1)
For i = n To table.Rows.Length - 1
r = r + 1
For j = 0 To table.Rows(i).Children.Length - 2
results(r, j + 1) = table.Rows(i).Children(j).innerText
Next
Next
firstTable = False
End Sub
Public Function cleanedDate(ByVal dirtyString As String, ByVal re As Object) As Date
re.Pattern = "(^[A-Z][a-z]{2}).*(\d{2}),.(\d{4})(.*)"
On Error GoTo errhand:
If re.test(dirtyString) Then
cleanedDate = CDate(re.Replace(dirtyString, "$2" & Chr$(32) & "$1" & Chr$(32) & "$3"))
Exit Function
End If
errhand:
cleanedDate = DateAdd("d", 1, Date)
End Function
Public Function GetMoreRows(ByVal xhr As Object, ByVal dateStamp As String) As String
With xhr
.Open "POST", "https://www.investing.com/economic-calendar/more-history", False
.setRequestHeader "User-Agent", "Safari/537.36"
.setRequestHeader "x-requested-with", "XMLHttpRequest"
.setRequestHeader "content-type", "application/x-www-form-urlencoded"
.send "eventID=430865&event_attr_ID=59&event_timestamp=" & dateStamp & "+" & Application.WorksheetFunction.EncodeURL("12:30:00") & "&is_speech=0"
GetMoreRows = .responseText
End With
End Function
Regexes (without the double " escaping for VBA):
hasMoreHistory":("?.*?"?)}
historyRows":"(.*)",

Web Scraping - VBA

I am trying to scrape data from a website without any luck. i manage to navigate through Elements but I haven't managed to get the information from the last Elements. Below is my code, any help would be appreciated.
Option Explicit
Sub Download_Historical_Data()
Dim IE As InternetExplorer, doc As HTMLDocument
Dim All_Matches, Match
Dim All_Champions, Champion
'Open Browser and download data
Set IE = New InternetExplorer
With IE
.Visible = True
.Navigate ("https://www.scorespro.com/soccer/results/")
While .Busy Or .readyState < 4: DoEvents: Wend
Set doc = .document
End With
Set All_Champions = doc.getElementById("matches-data").getElementsByClassName("compgrp")
For Each Champion In All_Champions
Set All_Matches = Champion.getElementsByTagName("table")
For Each Match In All_Matches
If Left(Match.className, 12) = "blocks gteam" Then
With Match
'All the info
End With
End If
Next Match
Next Champion
IE.Quit
Set IE = Nothing
End Sub
Sample on 9/8/19:
Sample on 7/8/19:
Output:
The reason i have use as sample 2 different days is because there is a game with penalties and i want to include this as well.
You don't need to automate a browser. If you inspect the network traffic when selecting a date you will see an XHR request for the info. You can use those details (in fact I shorten to just the required url params) to retrieve the page content.
The info is contained in table tag elements. The champion is in tables with class name blockBar, otherwise the info is for the row info as seen on page. In order to leverage querySelector (which is a method of HTMLDocument) to select the sub table level elements, by class name, for individual tables, I stick the individual table html into a surrogate html document variable; I then have access to querySelector again and so can write nice flexible/descriptive css selectors to match on elements.
The columns in your output all have nice descriptive class names in the XHR response, so you can use those to determine which column to write to. As score info may risk losing formatting on output I use a Select Case statement, to test for those css selectors, and append a single quote to preserve formatting on output.
I choose, for efficiency, to store all results in an array and write out in one go.
Option Explicit
Public Sub GetMatchInfo()
Dim headers(), results(), r As Long, c As Long, ws As Worksheet, i As Long
Dim champion As String, html As HTMLDocument, html2 As HTMLDocument, cssSelectors(), j As Long
Set html = New HTMLDocument
Set html2 = New HTMLDocument
Set ws = ThisWorkbook.Worksheets("Sheet1")
headers = Array("Date", "Time", "Status", "Champion", "Home Team", "Full Time Score", "Away Team", "Half Time", "Penalties Score")
cssSelectors = Array(".kick_t_dt", ".kick_t_ko", ".status", "champion", ".home", ".score_link", ".away", ".halftime", ".after_pen")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.scorespro.com/soccer/ajax-calendar.php?mode=results&date=2019-08-07", False
.send
html.body.innerHTML = .responseText
End With
Dim tables As Object, selector As String
Set tables = html.querySelectorAll("table")
ReDim results(1 To tables.Length, 1 To UBound(headers) + 1)
For i = 0 To tables.Length - 1
If tables.item(i).className = "blockBar" Then
champion = tables.item(i).innerText
Else
r = r + 1
html2.body.innerHTML = tables.item(i).outerHTML
On Error Resume Next
For j = LBound(cssSelectors) To UBound(cssSelectors)
selector = cssSelectors(j)
Select Case selector
Case ".score_link", ".halftime", ".after_pen"
results(r, j + 1) = "'" & html2.querySelector(cssSelectors(j)).innerText
Case "champion"
results(r, j + 1) = champion
Case Else
results(r, j + 1) = html2.querySelector(cssSelectors(j)).innerText
End Select
Next
On Error GoTo 0
End If
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub
Example sample output:
Using IE
Option Explicit
Public Sub GetMatchInfo()
Dim headers(), results(), r As Long, c As Long, ws As Worksheet, i As Long
Dim champion As String, html As HTMLDocument, html2 As HTMLDocument, cssSelectors(), j As Long
Set html = New HTMLDocument
Set html2 = New HTMLDocument
Set ws = ThisWorkbook.Worksheets("Sheet1")
headers = Array("Date", "Time", "Status", "Champion", "Home Team", "Full Time Score", "Away Team", "Half Time", "Penalties Score")
cssSelectors = Array(".kick_t_dt", ".kick_t_ko", ".status", "champion", ".home", ".score_link", ".away", ".halftime", ".after_pen")
With CreateObject("InternetExplorer.Application")
.Navigate2 "https://www.scorespro.com/soccer/results/"
While .Busy Or .readyState <> 4: DoEvents: Wend
Application.Wait Now + TimeSerial(0, 0, 2)
html.body.innerHTML = .document.body.innerHTML
.Quit
End With
Dim tables As Object, selector As String
Set tables = html.querySelectorAll("table")
ReDim results(1 To tables.Length, 1 To UBound(headers) + 1)
For i = 0 To tables.Length - 1
If tables.item(i).className = "blockBar" Then
champion = tables.item(i).innerText
Else
r = r + 1
html2.body.innerHTML = tables.item(i).outerHTML
On Error Resume Next
For j = LBound(cssSelectors) To UBound(cssSelectors)
selector = cssSelectors(j)
Select Case selector
Case ".score_link", ".halftime", ".after_pen"
results(r, j + 1) = "'" & html2.querySelector(cssSelectors(j)).innerText
Case "champion"
results(r, j + 1) = champion
Case Else
results(r, j + 1) = html2.querySelector(cssSelectors(j)).innerText
End Select
Next
On Error GoTo 0
End If
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub
I wrote this in WSL (web scraping language) but basically you can edit the json to add any other fields (assuming all the football games). Once you got all the data, then you can either have it emailed to you or your web server.
GOTO www.scorespro.com/soccer/results/ >>
EXTRACT {'time': '.kick_t', 'status':'.status',
'home':'.home.uc', 'score':'.score', 'away':'.away', 'match':'a'} IN table tr
Explanation: it goes to that score page, and then pulls time, status, home, score, away fields for each table row via table tr and finally the match field which will come from the header bar table row. It will look like {'time':undefined, ...., 'match':'Armenia: Premier League'} along with other table row game schedules like {'time':'2019/8/21' ,..., 'match':undefined}. Just merge the JSON objects afterwards.

VBA - web scraping can not find correct GET request

My question is related to other question VBA - web scraping can not get HTMLElement innerText. I have a similar problem
Website URL - https://www.hnb.hr/en/core-functions/monetary-policy/exchange-rate-list/exchange-rate-list
I need to get the date of currency reference and the selected values. The problem is that I can not find a correct GET request where these values are finally generated. I've found that it is related to the POST request:
POST /en/core-functions/monetary-policy/exchange-rate-list/exchange-rate-list?p_p_id=tecajnalistacontroller_WAR_hnbtecajnalistaportlet&p_p_lifecycle=2&p_p_state=normal&p_p_mode=view&p_p_resource_id=getTecajnaAjaxDataURL&p_p_cacheability=cacheLevelPage&p_p_col_id=column-2&p_p_col_count=2 HTTP/1.1
I would like to use a technique with getting by id, class or tag - whatever but again, provided GET URL request is too quick to retrieve the required info
XMLHTTP request and API:
I would use their API as shown below. I have some helper functions to aid with parsing the response. In GetDict function you can set the currencies you are interested in. In function GetRate you can specify the rate you are interested in. If you don't specify, it defaults to "median_rate".
Calling the API:
To get the rates for a particular date, make a[n] HTTP call to the
following URL:
http://hnbex.eu/api/v1/rates/daily/?date=YYYY-MM-DD
The date parameter is optional. If not set, the current date (today)
is used.
You can parse the JSON response with a JSON parser but I found it simpler to go with using Split to grab the required info from the JSON string. If you are familiar with JSON I will happily update with a JSON parsing example.
Option Explicit
Public Sub GetInfo()
'http://hnbex.eu/api/v1/
Dim strJSON As String, http As Object, json As Object
Const URL As String = "http://hnbex.eu/api/v1/rates/daily/"
Set http = CreateObject("MSXML2.XMLHTTP")
With http
.Open "GET", URL, False
.send
strJSON = .responseText
End With
'Set json = JsonConverter.ParseJson(strJSON) '<== You could parse the JSON using a JSON parse such as [JSONConverter][1]
Dim currencyDict As Object
Set currencyDict = GetDict
Dim key As Variant, dictKeys As Variant, result As Variant
For Each key In currencyDict.keys
result = GetRate(strJSON, key)
If Not IsError(result) Then currencyDict(key) = result
result = vbNullString
Next key
PrintDictionary currencyDict
End Sub
Public Function GetDict() As Object '<== You could adapt to pass currencies as string arguments to the function. Or even a string array.
Dim dict As Object
Set dict = CreateObject("Scripting.Dictionary")
dict.Add "EUR", vbNullString
dict.Add "CZK", vbNullString
dict.Add "HRK", vbNullString
dict.Add "HUF", vbNullString
dict.Add "PLN", vbNullString
dict.Add "RON", vbNullString
dict.Add "RSD", vbNullString
Set GetDict = dict
End Function
Public Function GetRate(ByVal json As String, ByVal key As Variant, Optional ByVal rate As String = "median_rate") As Variant
Dim arr() As String, tempString As String
On Error GoTo Errhand
arr = Split(json, """currency_code"": " & Chr$(34) & key & Chr$(34))
tempString = arr(1)
tempString = Split(arr(1), Chr$(34) & rate & Chr$(34) & ":")(1)
tempString = Split(tempString, ",")(0)
GetRate = tempString
Exit Function
Errhand:
GetRate = CVErr(xlErrNA)
End Function
Public Sub PrintDictionary(ByVal dict As Object)
Dim key As Variant
For Each key In dict.keys
Debug.Print key & " : " & dict(key)
Next
End Sub
Internet Explorer:
You can use an loop with explicit wait for element to be present on page (or populated)
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer, t As Date, hTable As HTMLTable, clipboard As Object
Const WAIT_TIME_SECS As Long = 5
t = Timer
With IE
.Visible = True
.navigate "https://www.hnb.hr/en/core-functions/monetary-policy/exchange-rate-list/exchange-rate-list"
While .Busy Or .readyState < 4: DoEvents: Wend
Do
DoEvents
On Error Resume Next
Set hTable = .document.getElementById("records_table")
On Error GoTo 0
If Timer - t > WAIT_TIME_SECS Then Exit Do
Loop While hTable Is Nothing
If hTable Is Nothing Then
.Quit
Exit Sub
End If
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
clipboard.SetText hTable.outerHTML
clipboard.PutInClipboard
ThisWorkbook.Worksheets("Sheet1").Cells(1, 1).PasteSpecial
.Quit '<== Remember to quit application
End With
End Sub

Inefficient UDF with Internet Explorer

The below UDF opens IE and returns the currency conversion rate from USD to the input (another currency ticker i.e. EUR, GBP, HKD, etc.) For instance, if the input was ConvertUSD(USD), the output would be 1 since 1USD = 1USD.
Using the equation once is fine, the issue im having is related to the way I intend to use the function. I need to build a table with Currency tickers spanning Col A (known values and will be text). Col B will then show the corresponding rows conversion rate. I intend to set B2 = ConvertUSD(A2), and then drag this down to the bottom row (roughly 48 currencies so ending row = B49). When I do this, 48 IE windows will be opened and closed which is not ideal, but I am unsure how to avoid this.
How to create this table with just one instance of IE being opened?
Public Function ConvertUSD(ConvertWhat As String) As Double
'References
' Microsoft XML, vs.0
' Microsoft Internet Controls
' Microsoft HTML Object Library.
Dim IE As New InternetExplorer
'IE.Visible = True
IE.Navigate "https://www.oanda.com/currency/converter?quote_currency=USD&base_currency=" & ConvertWhat
Do
DoEvents
Loop Until IE.ReadyState = ReadyState_Complete
Dim Doc As HTMLDocument
Set Doc = IE.Document
Dim Ans As String
Ans = Trim(Doc.getElementsByTagName("tbody")(2).innerText)
Dim AnsExtract As Variant
AnsExtract = Split(Ans, " ")
ConvertUSD = AnsExtract(4)
IE.Quit
End Function
I think a more efficient method would be to use one of the sites that provides api access to this kind of data. There are a number of both free and paid sites available. The routine below (which makes use of a free api) will download and write to a worksheet 170 foreign currencies in a fraction of a second and does not open ANY IE windows. For this download, I have specified USD as the base currency, but you can specify any base.
The output from the website is as a JSON, so a JSON parser will be of value. I used the free one available at:
VBA-JSON v2.2.3
(c) Tim Hall - https://github.com/VBA-tools/VBA-JSON
but there are others that run in VBA. Or you can write your own.
This also requires a reference to be set to Microsoft winHTTP Services, Version 5.1 (or you could use late binding)
Option Explicit
Sub latestForex()
Const app_id As String = "your_own_api_key"
Const sURL1 As String = "https://openexchangerates.org/api/latest.json"
Const sURL2 As String = "?app_id="
Const sURL3 As String = "&base=USD"
Dim sURL As String
Dim vRes As Variant, wsRes As Worksheet, rRes As Range
Dim v, w, i As Long
Dim httpRequest As WinHttpRequest
Dim strJSON As String, JSON As Object
sURL = sURL1 & sURL2 & app_id & sURL3
Set httpRequest = New WinHttpRequest
With httpRequest
.Open "Get", sURL
.send
.WaitForResponse
strJSON = .responseText
End With
Set httpRequest = Nothing
Set JSON = ParseJson(strJSON)
i = 0
ReDim vRes(0 To JSON("rates").Count, 1 To 2)
Set wsRes = Worksheets("sheet3")
Set rRes = wsRes.Cells(1, 1)
vRes(0, 1) = (JSON("timestamp") / 86400) + #1/1/1970# 'UTC time
vRes(0, 2) = JSON("base")
For Each v In JSON("rates")
i = i + 1
vRes(i, 1) = v
vRes(i, 2) = JSON("rates")(v)
Next v
Set rRes = rRes.Resize(UBound(vRes, 1) + 1, UBound(vRes, 2))
With rRes
.EntireColumn.Clear
.Value2 = vRes
.Cells(1, 1).NumberFormat = "dd-mmm-yyyy hh:mm"
.Columns(2).NumberFormat = "$0.0000"
.EntireColumn.AutoFit
End With
End Sub
Here is a portion of the results.
Note that the time stamp is UTC. Obviously you can change that to local time.
Don't use a UDF. Just use a sub/macro to refresh the whole list on demand.
Do it like this:
Sub RefreshCurrencyRates()
' Run this sub as a macro. Use a keyboard shortcut or a button to invoke it.
' You can even add a call to the sub in the Workbook_Open event if you like.
' This sub assumes that the relevant sheet is the active sheet. This will always be the case is you use a
' button placed on the sheet itself. Otherwise, you might want to add further code to specify the sheet.
'
' Best practice:
With Application
.ScreenUpdating = False
.Calculation = xlCalculationManual
.DisplayAlerts = False
.EnableEvents = False
End With
'
' The first thing you need to do is specify the range of rows which contain your currency codes.
' I'm hard-coding this here, but you can change it.
' As a first example, let's assume that you have the following currencies in cells A1-A4:
' A1 = GBP
' A2 = EUR
' A3 = HKD
' A4 = JPY
'
' So with rows 1-4, we'll do the following:
Dim RowNum As Long, CurCode As String
' Set up our Internet Explorer:
Dim IE As InternetExplorer
Set IE = New InternetExplorer
'
For RowNum = 1 To 4
CurCode = Cells(RowNum, 1).Value ' Takes the currency code from column A in each row
Cells(RowNum, 2).Value = ConvertUSD(CurCode, IE) ' Gets the relevant conversion and enters it into column B
Next RowNum
' Cleardown
IE.Quit
Set IE = Nothing
With Application
.ScreenUpdating = True
.Calculation = xlCalculationAutomatic
.DisplayAlerts = True
.EnableEvents = True
End With
End Sub
Public Function ConvertUSD(ByVal ConvertWhat As String, IE As InternetExplorer) As Double
'References
' Microsoft XML, vs.0
' Microsoft Internet Controls
' Microsoft HTML Object Library.
IE.Navigate "https://www.oanda.com/currency/converter?quote_currency=USD&base_currency=" & ConvertWhat
Do
DoEvents
Loop Until IE.ReadyState = ReadyState_Complete
Dim Doc As HTMLDocument
Set Doc = IE.Document
Dim Ans As String
Ans = Trim(Doc.getElementsByTagName("tbody")(2).innerText)
Dim AnsExtract As Variant
AnsExtract = Split(Ans, " ")
ConvertUSD = AnsExtract(4)
End Function

web scraping using excel and VBA

i wrote my VBA code in excel sheet as below but it is not scrape data for me and also i don't know why please any one help me. it gave me reullt as "click her to read more" onlyi want to scrape enitre data such as first name last name state zip code and so on
Sub extractTablesData()
Dim IE As Object, obj As Object
Dim myState As String
Dim r As Integer, c As Integer, t As Integer
Dim elemCollection As Object
Set IE = CreateObject("InternetExplorer.Application")
myState = InputBox("Enter the city where you wish to work")
With IE
.Visible = True
.navigate ("http://www.funeralhomes.com/go/listing/Search? name=&city=&state=&country=USA&zip=&radius=")
While IE.readyState <> 4
DoEvents
Wend
For Each obj In IE.document.all.item("state").Options
If obj.innerText = myState Then
obj.Selected = True
End If
Next obj
IE.document.getElementsByValue("Search").item.Click
Do While IE.Busy: DoEvents: Loop
ThisWorkbook.Sheets("Sheet1").Range("A1:K1500").ClearContents
Set elemCollection = IE.document.getElementsByTagName("TABLE")
For t = 0 To (elemCollection.Length - 1)
For r = 0 To (elemCollection(t).Rows.Length - 1)
For c = 0 To (elemCollection(t).Rows(r).Cells.Length - 1)
ThisWorkbook.Worksheets(1).Cells(r + 1, c + 1) = elemCollection(t).Rows(r).Cells(c).innerText
Next c
Next r
Next t
End With
Set IE = Nothing
End Sub
Using the same URL as the answer already given you could alternatively select with CSS selectors to get the elements of interest, and use split to get just the names and address parts from the text. We can also do away with the browser altogether to get faster results from first results page.
Business name:
You can get the name with the following selector (using paid listing example):
div.paid-listing .listing-title
This selects (sample view)
Try
Address info:
The associated descriptive information can be retrieved with the selector:
div.paid-listing .address-summary
And then using split we can parse this into just the address information.
Code:
Option Explicit
Public Sub GetTitleAndAddress()
Dim oHtml As HTMLDocument, nodeList1 As Object, nodeList2 As Object, i As Long
Const URL As String = "http://www.funeralhomes.com/go/listing/ShowListing/USA/New%20York/New%20York"
Set oHtml = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", URL, False
.send
oHtml.body.innerHTML = .responseText
End With
Set nodeList1 = oHtml.querySelectorAll("div.paid-listing .listing-title")
Set nodeList2 = oHtml.querySelectorAll("div.paid-listing .address-summary")
With Worksheets("Sheet3")
.UsedRange.ClearContents
For i = 0 To nodeList1.Length - 1
.Range("A" & i + 1) = nodeList1.Item(i).innerText
.Range("B" & i + 1) = Split(nodeList2.Item(i).innerText, Chr$(10))(0)
Next i
End With
End Sub
Example output:
Yeah, without an API, this can be very tricky at best, and very inconsistent at worst. For now, you can try the script below.
Sub DumpData()
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
URL = "http://www.funeralhomes.com/go/listing/ShowListing/USA/New%20York/New%20York"
'Wait for site to fully load
IE.Navigate2 URL
Do While IE.Busy = True
DoEvents
Loop
RowCount = 1
With Sheets("Sheet1")
.Cells.ClearContents
RowCount = 1
For Each itm In IE.document.all
If itm.classname Like "*free-listing*" Or itm.classname Like "*paid-listing*" Then
.Range("A" & RowCount) = itm.classname
.Range("B" & RowCount) = Left(itm.innertext, 1024)
RowCount = RowCount + 1
End If
Next itm
End With
End Sub
You probably want some kind of input box to capture the city and state and radius from the user, or capture those variable in cells in your worksheet.
Notice, the '%20' is a space character.
I got this idea from a friend of mine, Joel, a long time ago. That guy is great!

Resources