VBA - web scraping can not get HTMLElement innerText

VBA - web scraping can not get HTMLElement innerText - excel

I'm trying to scrap the exchange rates using excel VBA but I can not get the innerText value I need. I don't understand why because the same technique works on the other sites.
URL - https://www.nbs.rs/export/sites/default/internet/english/scripts/kl_srednji.html
Sub GetCurr()
Dim tempHTMLDoc As New MSHTML.HTMLDocument
Dim HTMLCurrency As MSHTML.IHTMLElementCollection
Dim HTMLRows As MSHTML.IHTMLElementCollection
Dim HTMLDate As MSHTML.IHTMLElementCollection
Dim HTMLElem As MSHTML.IHTMLElement
Dim connectionTest As Boolean
Dim EUR, CZK, HRK, HUF, PLN, RON, RSD As String
Dim myURL As String
Dim i As Long
connectionTest = True
myURL = "https://www.nbs.rs/export/sites/default/internet/english/scripts/kl_srednji.html"
Call WebConnection(tempHTMLDoc, connectionTest, myURL)
If connectionTest = False Then Exit Sub
Set HTMLDate = tempHTMLDoc.getElementsByTagName("span")
'Debug.Print HTMLDate.Length
For Each HTMLElem In HTMLDate 'I am looking for which element contains the date (can not find)
Debug.Print HTMLElem.innerText
Next HTMLElem
'I am trying to get the necessary currencies
Set HTMLRows = tempHTMLDoc.getElementsByTagName("tr")
Debug.Print HTMLRows.Length
For i = 0 To HTMLRows.Length - 1 'If lenght > 0
Set HTMLCurrency = HTMLRows(i).getElementsByTagName("td")
If HTMLCurrency.Length > 4 Then 'each currency contains 5 "td" tags
Select Case HTMLCurrency(2).innerText
Case "EUR"
EUR = HTMLCurrency(4).innerText
Case "HRK"
HRK = HTMLCurrency(4).innerText
Case "HUF"
HUF = HTMLCurrency(4).innerText
Case "PLN"
PLN = HTMLCurrency(4).innerText
Case "RON"
RON = HTMLCurrency(4).innerText
Case "CZK"
CZK = HTMLCurrency(4).innerText
End Select
End If
Next i
Debug.Print "EUR - ", EUR; vbNewLine; "HRK - ", HRK; vbNewLine; "HUF - ", HUF; vbNewLine; "PLN - ", PLN; vbNewLine; _
"RON - ", RON; vbNewLine; "CZK - ", CZK
End Sub
'============================================================================
Sub WebConnection(HTMLDoc As MSHTML.HTMLDocument, ConnTest As Boolean, URL As String)
Dim XMLPage As New MSXML2.XMLHTTP60
Dim errorMsg As VbMsgBoxResult
On Error GoTo CONNECTION_ERROR
XMLPage.Open "GET", URL, False
XMLPage.send
On Error GoTo 0
If XMLPage.Status <> 200 Then
errorMsg = MsgBox("There is something wrong with webpage. Do you want to try to continue?", vbYesNo + vbCritical, "ERROR")
If errorMsg = vbNo Then
ConnTest = False
Exit Sub
End If
End If
HTMLDoc.body.innerHTML = XMLPage.responseText
Exit Sub
CONNECTION_ERROR:
MsgBox "There is something wrong with the connection.", vbCritical, "ERROR"
ConnTest = False
Exit Sub
End Sub
I tried to use id (index:srednjiKursList:tbody_element) or class name(tableCell) but it doesn't work. This website is built in a different way

Your original link, let's call it the landing page, is dynamically loaded. Your GET request is too quick to retrieve the required info.
There is an alternative URL you can use.
When you go to the landing page you show it actually issues an XMLHTTP GET request to the following page:
The above is from using fiddler but you could inspect the web traffic with, for example, Chrome dev tools (F12).
You can input that URL directly into your code and it works perfectly.
Whole table:
You can also grab the whole table as follows:
Option Explicit
Public Sub GetInfo()
Dim html As New HTMLDocument, hTable As HTMLTable, clipboard As Object
Const URL = "https://www.nbs.rs/kursnaListaModul/srednjiKurs.faces?lang=eng"
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.send
html.body.innerHTML = StrConv(.responseBody, vbUnicode)
End With
Set hTable = html.getElementById("index:srednjiKursLista")
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
clipboard.SetText hTable.outerHTML
clipboard.PutInClipboard
ThisWorkbook.Worksheets("Sheet1").Cells(1, 1).PasteSpecial
End Sub
Sample of results:
Just the listed currencies:
You could also use a little maths, based on table structure, to get just those elements you listed.
Option Explicit
Public Sub GetInfo()
Dim html As New HTMLDocument, hTable As HTMLTable, clipboard As Object
Const URL = "https://www.nbs.rs/kursnaListaModul/srednjiKurs.faces?lang=eng"
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.send
html.body.innerHTML = StrConv(.responseBody, vbUnicode)
End With
Set hTable = html.getElementById("index:srednjiKursLista")
Dim list As Object, i As Long
Dim EUR As Double, CZK As Double, HRK As Double, HUF As Double, PLN As Double, RON As Double, RSD As Double
Set list = hTable.querySelectorAll("td")
For i = 2 To list.Length - 1 Step 5
Select Case list.item(i).innerText
Case "EUR"
EUR = list.item(i + 2).innerText
Case "HRK"
HRK = list.item(i + 2).innerText
Case "HUF"
HUF = list.item(i + 2).innerText
Case "PLN"
PLN = list.item(i + 2).innerText
Case "RON"
RON = list.item(i + 2).innerText
Case "CZK"
CZK = list.item(i + 2).innerText
End Select
Next
Debug.Print "EUR - ", EUR; vbNewLine; "HRK - ", HRK; vbNewLine; "HUF - ", HUF; vbNewLine; "PLN - ", PLN; vbNewLine; _
"RON - ", RON; vbNewLine; "CZK - ", CZK
End Sub
Using the clipboard:
The following line:
GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")
adds a late bound reference to Microsoft Forms Object Library so you can access the clipboard.
You could also either add a userform to your project or go VBE > Tools > references > Microsoft Forms Object Library to have access:

Related

Historical Yahoo Finance API On Fritz Again?

I've been successfully using the Yahoo Finance API cookie/crumb technique to get historical stock quotes for many years. On April 28th, 2022, it broke, and excessive exercise of vocabulary has failed to fix it.
The technique is to lookup an illegible stock symbol, because part of the returned cookie can be used to obtain real historical data on successive uses of the historical Yahoo API. I've tried it with illegible and legitimate stock symbols. I'm using Excel 2019 VBA, and the debug seems to hang on the ".waitForResponse (10)" instruction. It's part of a loop, and it hangs on the first instance. The code is shown below. It successfully writes a zero on ws1.S10, but it fails to do anything after the "Next cook" instruction. Did Yahoo intentionally break the Finance API again, or did Microsoft "improve" Excel? Or, more likely, did I do something stump stupid, like turn on the computer? Thanks!
Sub HistUp()
Dim resultFromYahoo, csv_rows() As String
Dim objRequest
Dim resultArray As Variant
Dim eagle, nColumns, cook, iRows, iCols As Integer
Dim CSV_Fields As Variant
Dim ticker, tickerURL, cookie, crumb As String
Dim HistQuote, HistDiv, DefaultKey As String
Dim Curr, StartPer As String
Dim fox, sheep, bear, elk, wolf, raccoon, snake As Integer
Dim julian, ricky, bubbles As Double
Dim crumbStartPos, crumbEndPos, Lastrow1, Lastrow2 As Long
Set wb = ThisWorkbook
Set ws1 = wb.Worksheets(1)
Set ws2 = wb.Worksheets(2)
Set ws3 = wb.Worksheets(3)
Set ws4 = wb.Worksheets(4)
Set ws5 = wb.Worksheets(5)
Application.EnableEvents = False
Application.DisplayAlerts = False
eagle = ActiveSheet.Index
wb.Worksheets("Warn").Select
wb.Worksheets("Warn").Range("A1").Select
DoEvents
'getCookieCrumb
For cook = 0 To 5 'ask for a valid crumb 6 times
ws1.Range("S10") = cook
Set objRequest = CreateObject("WinHttp.WinHttpRequest.5.1")
With objRequest
.Open "GET", "https://finance.yahoo.com/lookup?s=turpitude", False
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded; charset=UTF-8"
.send
.waitForResponse (10)
cookie = Split(.getResponseHeader("Set-Cookie"), ";")(0)
crumbStartPos = InStrRev(.ResponseText, """crumb"":""") + 9
crumbEndPos = crumbStartPos + 11
crumb = Mid(.ResponseText, crumbStartPos, crumbEndPos - crumbStartPos)
End With
If Len(crumb) = 11 Then 'a valid crumb is 11 characters long
Exit For
End If
Next cook

I have the same problem. Looks like Yahoo changed the Yahoo finance API.
When I comment out the line:
cookie = Split(.getResponseHeader("Set-Cookie"), ";")(0)
the code seems to work as before.
Then also comment out
'.setRequestHeader "Cookie", Cookie
when you send the request. Rest of my code:
'CONSTRUCT THE URL:
'interval=1d or 1wk or 1mo, events=dividends or events=history (prices) or events=splits
WebRequestURL = "https://query1.finance.yahoo.com/v7/finance/download/" & StockSymbol & _
"?period1=" & UnixStartDate & "&period2=" & UnixEndDate & _
"&interval=" & UrlInterval & "&events=" & UrlEvents & "&crumb=" & Crumb
'FETCH THE DATA:
With WebRequest
.Open "GET", WebRequestURL, False
'.setRequestHeader "Cookie", Cookie
.Send
.waitForResponse (10)
End With

As people have commented, the "Set-Cookie" header is no longer in the response, however the crumbstore is still there, so I would recommend checking to see if the header exists, and if not continue without setting that header or including the crumb.
Here is more robust code that allows you to select if you want price history, dividends, or split. You can also select a date range, by converting the dates to POSIX. It can also accommodate a proxy server. Feel free to comment on any improvements I can make to this code.
'New Yahoo Data Method
'sEvent: history, div, split
'sInterval: 1d,1wk,1mo
Public Function GetNewYahooData(sSymbol As String, sStart As String, sEnd As String, sEvent As String, sInterval As String, sProxy As String) As String
On Error GoTo Error_Message
Const sTestURL = "https://query1.finance.yahoo.com/v7/finance/download/"
Const sBaseURL = "https://finance.yahoo.com/quote/^GSPC"
'This assumes the crumb appears like this: "CrumbStore":{"crumb":"taEvjA8DFqs"}
Const sCrumbStart = """CrumbStore"":{""crumb"":"""
Const sCrumbEnd = """"
Const sTickerReplace = "TTTT"
Const sPeriod1Replace = "pppppppp"
Const sPeriod2Replace = "qqqqqqqq"
Const sEventReplace = "eeeeeeee"
Const sCrumbReplace = "cccccccc"
Const sIntervalReplace = "iiiiiiii"
Dim sReturn As String
Dim sTemURL As String
Dim sCookie As String
Dim sCrumb As String
Dim dtStart As Date
Dim dtEnd As Date
Dim lngCrumbStart As Long
Dim lngCrumbEnd As Long
Dim objRequest As WinHttp.WinHttpRequest
Dim sContentType As String
Dim bolHaveCrumb As Boolean
dtStart = CDate(sStart)
dtEnd = CDate(sEnd)
bolHaveCrumb = False
'Perform a Yahoo financial lookup on SP500 to get the crumb
Set objRequest = New WinHttp.WinHttpRequest
With objRequest
If Len(sProxy) > 0 Then .SetProxy 2, sProxy, ""
.Open "GET", sBaseURL, False
.SetRequestHeader "Content-Type", "application/x-www-form-urlencoded; charset=UTF-8"
.sEnd
.WaitForResponse (10)
sReturn = .ResponseText
If InStr(1, sReturn, sCrumbStart) > 0 Then
lngCrumbStart = InStr(1, sReturn, sCrumbStart) + 23
lngCrumbEnd = InStr(lngCrumbStart, sReturn, sCrumbEnd)
sCrumb = Mid(sReturn, lngCrumbStart, lngCrumbEnd - lngCrumbStart)
sCookie = .GetAllResponseHeaders
If InStr(sCookie, "Set-Cookie") > 0 Then
sCookie = Split(.GetResponseHeader("Set-Cookie"), ";")(0)
bolHaveCrumb = True
Else
bolHaveCrumb = False
End If
End If
End With
'This is currently https://query1.finance.yahoo.com/v7/finance/download/TTTT?period1=pppppppp&period2=qqqqqqqq&interval=iiiiiiii&events=eeeeeeee&crumb=cccccccc
sTemURL = gYAHOO_HIS_URL
sTemURL = Replace(sTemURL, sIntervalReplace, sInterval)
sTemURL = Replace(sTemURL, sTickerReplace, sSymbol)
sTemURL = Replace(sTemURL, sPeriod1Replace, toPOSIX(dtStart))
sTemURL = Replace(sTemURL, sPeriod2Replace, toPOSIX(dtEnd))
sTemURL = Replace(sTemURL, sEventReplace, sEvent)
If bolHaveCrumb Then
sTemURL = Replace(sTemURL, sCrumbReplace, sCrumb)
Else
sTemURL = Replace(sTemURL, "&" & sCrumbReplace, "")
End If
Set objRequest = New WinHttp.WinHttpRequest
With objRequest
If Len(sProxy) > 0 Then .SetProxy 2, sProxy, ""
.Open "GET", sTemURL, False
If bolHaveCrumb Then .SetRequestHeader "Cookie", sCookie
.sEnd
.WaitForResponse (10)
sContentType = .GetResponseHeader("Content-Type")
sReturn = StrConv(.ResponseText, vbUnicode)
sReturn = StrConv(sReturn, vbFromUnicode)
End With
If Len(sReturn) > 0 Then
GetNewYahooData = sReturn
Else
GetNewYahooData = ""
End If
Exit Function
Error_Message:
MsgBox err.Description, vbCritical, "Yahoo Price Retrieval"
GetNewYahooData = ""
End Function
'Helper function to convert a date into its POSIX representation
Public Function toPOSIX(dt As Date) As Long
On Error GoTo err
toPOSIX = DateDiff("s", "1/1/1970", dt)
Exit Function
err:
MsgBox err.Description, vbOKOnly, "toPOSIX"
End Function
Also sometimes the first time you make the request, it might not return anything, so I usually try 5 times before giving up.
'Sometimes we won't get a response on the first try, so try 5 times
While Not YahooEnd
gHTMLFileHis = GetNewYahooData(txtSymbol.Text, txtStartDate.Text, txtEndDate.Text, sEvent, sInterval, txtproxy.Text)
i = i + 1
If Len(gHTMLFileHis) = 0 Then
YahooSuccess = False
Else
arrRows = Split(gHTMLFileHis, vbLf)
arrRow = Split(arrRows(0), ",")
If arrRow(0) = StrConv("Date", vbUnicode) Or arrRow(0) = "Date" Then
YahooSuccess = True
YahooEnd = True
End If
If i = 5 Then YahooEnd = True
End If
Wend

Scrape economic data from investing website

I am working on a code to get data from : https://www.investing.com/economic-calendar/core-durable-goods-orders-59
I have got the code for getting this via httprequest: but looking to change this to work for the economic data (link above) is there any way I can get the same for the economic indicators??
code below:
Option Explicit
Sub Export_Table()
'Html Objects---------------------------------------'
Dim htmlDoc As MSHTML.HTMLDocument
Dim htmlBody As MSHTML.htmlBody
Dim ieTable As MSHTML.HTMLTable
Dim Element As MSHTML.HTMLElementCollection
'Workbooks, Worksheets, Ranges, LastRow, Incrementers ----------------'
Dim wb As Workbook
Dim Table As Worksheet
Dim i As Long
Set wb = ThisWorkbook
Set Table = wb.Worksheets("Sheet1")
'-------------------------------------------'
Dim xmlHttpRequest As New MSXML2.XMLHTTP60 '
'-------------------------------------------'
i = 2
'Web Request --------------------------------------------------------------------------'
With xmlHttpRequest
.Open "POST", "https://www.investing.com/instruments/HistoricalDataAjax", False
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
.setRequestHeader "X-Requested-With", "XMLHttpRequest"
.send "curr_id=951681&smlID=1695217&header=CLNX+Historical+Data&st_date=01%2F01%2F2017&end_date=03%2F01%2F2019&interval_sec=Monthly&sort_col=date&sort_ord=DESC&action=historical_data"
If .Status = 200 Then
Set htmlDoc = CreateHTMLDoc
Set htmlBody = htmlDoc.body
htmlBody.innerHTML = xmlHttpRequest.responseText
Set ieTable = htmlDoc.getElementById("curr_table")
For Each Element In ieTable.getElementsByTagName("tr")
Table.Cells(i, 1) = Element.Children(0).innerText
Table.Cells(i, 2) = Element.Children(1).innerText
Table.Cells(i, 3) = Element.Children(2).innerText
Table.Cells(i, 4) = Element.Children(3).innerText
Table.Cells(i, 5) = Element.Children(4).innerText
Table.Cells(i, 6) = Element.Children(5).innerText
Table.Cells(i, 7) = Element.Children(6).innerText
i = i + 1
DoEvents: Next Element
End If
End With
Set xmlHttpRequest = Nothing
Set htmlDoc = Nothing
Set htmlBody = Nothing
Set ieTable = Nothing
Set Element = Nothing
End Sub
Public Function CreateHTMLDoc() As MSHTML.HTMLDocument
Set CreateHTMLDoc = CreateObject("htmlfile")
End Function

I have used the excel tool Power Query for this very thing. It is also called Get & Transform Data. I don't like using vba for doing this type of thing.
To make it work:
In Excel Go to Data>Get Data>From Other Sources>From Web.
Enter the URL
Wait for the webpage to load and then pick the table that you want.
This website took awhile to load, but it did work for me.
Choose "Load" which goes directly to the sheet, or "Transform Data" to manipulate the data in Power Query. There are many options in power query such as split columns, filter data, Calculate Columns and ...

I would avoid the overhead of setting up a permanent connection and simply continue using XHR. With the data > from web, you cannot grab more rows than are present on the initial landing. If however you go with XHR, you can issue POST requests to get more data. The code below utilizes a loop to retrieve additional results beyond the immediate visible on the page.
When you press the Show more link there is a POST request for an additional 6 rows which uses the latest date from the current set of results as part of the POST body. The response returned is JSON. Rather than bringing in a JSON parser, given the standard nature of the JSON, and that I am already using regex to clean the date format in column 1 to put in the POST body, I use two simple regexes to extract the html for the next results table from the response, and to check whether there are more results.
The format of the JSON is:
{
"historyRows": "<tr>…..</tr>",
"hasMoreHistory": "1"
}
Or
{
"historyRows": "<tr>…..</tr>",
"hasMoreHistory": false
}
So, I do some cleaning of the extracted html in order to not confuse the html parser within MSHTML. Furthermore, I add in an id to identify the table I have constructed, so I can continue to use an id css selector (#) list within my UpdateDateResults function.
I initially oversize an array to store each retrieved table which I update ByRef. I loop requesting more results until either there are no more results, there is an error parsing the maximum date from the last retrieved table column 1, or until my specified earliest date for data retrieval falls within the date range of the latest returned table.
Finally, I write the results array out to the sheet in one go.
N.B. You can target the table by its id. It looks like the number at the end of the id could be the same as for the goods url, lending itself to generalizing the code below to work for other goods.
VBA:
Option Explicit
Public Sub GetInvestingInfo()
'tools > references > Microsoft HTML Object Library
Dim html As MSHTML.HTMLDocument, xhr As Object
Const STARTDATE As Date = "2019-11-25" 'Adjust as required. DateAdd("yyyy", -2, Date) 2 years back. This means may have some earlier months in _
batch that spans the start date but won't issue an additional request after this
Set xhr = CreateObject("MSXML2.XMLHTTP")
Set html = New MSHTML.HTMLDocument
With xhr
.Open "GET", "https://www.investing.com/economic-calendar/core-durable-goods-orders-59", False
.setRequestHeader "User-Agent", "Safari/537.36"
.send
html.body.innerHTML = .responseText
End With
Dim firstTable As Boolean, r As Long, results() As Variant
ReDim results(1 To 100000, 1 To 5)
'process initial table and update results, get cleaned date needed for request for more results
firstTable = True
Dim latestDate As String
UpdateDateResults latestDate, results, firstTable, r, html
Dim re As Object, maxDate As String, hasMoreHistory As Boolean, s As String
Set re = CreateObject("VBScript.RegExp")
With re
.Global = True
.MultiLine = False
End With
maxDate = cleanedDate(latestDate, re)
hasMoreHistory = True
Dim errorDate As Date
errorDate = DateAdd("d", 1, Date)
Do While maxDate >= STARTDATE And maxDate < errorDate 'break loop using pre-defined earliest date, error with date conversion, or when no more rows found
Application.Wait (Now + TimeSerial(0, 0, 1)) 'Pause
s = GetMoreRows(xhr, Format$(maxDate, "YYYY-MM-DD")) 'max a POST request for more data
re.Pattern = "hasMoreHistory"":(""?.*?""?)}" 'Check if there are more rows still available. "1" for yes, false for no
hasMoreHistory = (re.Execute(s)(0).submatches(0) <> False)
If Not hasMoreHistory Then Exit Do
re.Pattern = "historyRows"":""(.*)"","
html.body.innerHTML = "<table id=""me"">" & Replace$(re.Execute(s)(0).submatches(0), "\/", "/") & "</table>" 'fix html and feed into html variable
UpdateDateResults latestDate, results, firstTable, r, html
maxDate = cleanedDate(latestDate, re) 'convert value retrieved from last row in date column of table to an actual date
Loop
With ActiveSheet
.Cells.ClearContents
.Cells(1, 1).Resize(r, 5) = results 'Don't bother to resize results as clear all cells before write ou
End With
End Sub
Public Sub UpdateDateResults(ByRef latestDate As String, ByRef results() As Variant, ByRef firstTable As Boolean, ByRef r As Long, ByVal html As MSHTML.HTMLDocument)
Dim table As MSHTML.HTMLTable 'return latest date from function
Set table = html.querySelector("#eventHistoryTable59, #me")
latestDate = table.Rows(table.Rows.Length - 1).Children(0).innerText
Dim i As Long, n As Long, j As Long
n = IIf(firstTable, 0, 1)
For i = n To table.Rows.Length - 1
r = r + 1
For j = 0 To table.Rows(i).Children.Length - 2
results(r, j + 1) = table.Rows(i).Children(j).innerText
Next
Next
firstTable = False
End Sub
Public Function cleanedDate(ByVal dirtyString As String, ByVal re As Object) As Date
re.Pattern = "(^[A-Z][a-z]{2}).*(\d{2}),.(\d{4})(.*)"
On Error GoTo errhand:
If re.test(dirtyString) Then
cleanedDate = CDate(re.Replace(dirtyString, "$2" & Chr$(32) & "$1" & Chr$(32) & "$3"))
Exit Function
End If
errhand:
cleanedDate = DateAdd("d", 1, Date)
End Function
Public Function GetMoreRows(ByVal xhr As Object, ByVal dateStamp As String) As String
With xhr
.Open "POST", "https://www.investing.com/economic-calendar/more-history", False
.setRequestHeader "User-Agent", "Safari/537.36"
.setRequestHeader "x-requested-with", "XMLHttpRequest"
.setRequestHeader "content-type", "application/x-www-form-urlencoded"
.send "eventID=430865&event_attr_ID=59&event_timestamp=" & dateStamp & "+" & Application.WorksheetFunction.EncodeURL("12:30:00") & "&is_speech=0"
GetMoreRows = .responseText
End With
End Function
Regexes (without the double " escaping for VBA):
hasMoreHistory":("?.*?"?)}
historyRows":"(.*)",

How to skip a row in Excel with missing html tag using VBA

There are 15 objects listed on this website, each has a link under the photo. The 6th object has none. When extracting and transferring the content with my code the missing html-href is not skipped and in Excel, 14 hrefs are listed below each other (the 6th cell should remain empty or "no ducument"), but the last cell does (& error because 14<=>15). Unfortunately I have to keep my code structure and just need a loop or condition to complete it. Does anyone have any ideas? Thanks.
My incomplete code:
Public Sub GetData()
Dim html As New HTMLDocument
Dim elmt01 As Object, elmt02 As Object
Dim y As Long
Dim xURL As String
Set html = New MSHTML.HTMLDocument
xURL = "https://immobilienpool.de/suche/immobilien?page=1"
With CreateObject("MSXML2.XMLHTTP.6.0")
.Open "GET", xURL, False
.send
html.body.innerHTML = .responseText
End With
Set elmt01 = html.querySelectorAll("li[class*='contentBox']") '15 items
Set elmt02 = html.querySelectorAll("li a[title*='zusätzliche']") '14 hrefs
For y = 0 To elmt01.Length - 1
If InStr(elmt02, "pdf") Then 'better: If elmt02 exists in elmt01 then...
ActiveSheet.Cells(y + 1, 2) = elmt02.Item(y).href
Else
ActiveSheet.Cells(y + 1, 2) = "No document"
End If
Next
End Sub

The following script should solve the issue you are having. I had to modify your code to skip the blank row. I hope you will be able to comply with the current version:
Public Sub GetData()
Dim Html As HTMLDocument, HTMLDoc As HTMLDocument
Dim oPdfLink As Object, xURL As String, I As Long
Set Html = New MSHTML.HTMLDocument
Set HTMLDoc = New MSHTML.HTMLDocument
xURL = "https://immobilienpool.de/suche/immobilien?page=1"
With CreateObject("MSXML2.XMLHTTP.6.0")
.Open "GET", xURL, False
.send
Html.body.innerHTML = .responseText
End With
With Html.querySelectorAll("li[class*='contentBox']")
For I = 0 To .Length - 1
HTMLDoc.body.innerHTML = .item(I).outerHTML
Set oPdfLink = HTMLDoc.querySelector("a[title*='zusätzliche']")
If Not oPdfLink Is Nothing Then
ActiveSheet.Cells(I + 1, 2) = oPdfLink.href
Else:
ActiveSheet.Cells(I + 1, 2) = "No document"
End If
Next I
End With
End Sub

Getting elements using multiple tags

I wish to pull data into excel from the following link: https://echa.europa.eu/registration-dossier/-/registered-dossier/13817/7/1 for Tox summaries for inhalation routes, dermal, eyes etc
The code below partly achieves this
Public Sub GetContents3()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLReq.Open "Get", "https://echa.europa.eu/registration-dossier/-/registered-dossier/13817/7/1", False
XMLReq.send
HTMLDoc.body.innerHTML = XMLReq.responseText
Set SubSectList = HTMLDoc.getElementById("SectionContent")
Debug.Print SubSectList.innerText
End Sub
Although the code above works okay there are a few issues Id like to address.
For some reason the first Workers - Hazard via inhalation route is not being pulled.
I wish to use the tags dt and dd to refine how the information should be pulled. As an example of how I would like the data to be printed:
Workers - Hazard via inhalation route
Systemic effects
Long term exposure
Hazard assessment conclusion: DNEL (Derived No Effect Level)
Value: 1.41 mg/m³
... Etc (to include all data)
Workers - Hazard via dermal route
Systemic effects
Long term exposure
Hazard assessment conclusion: DNEL (Derived No Effect Level)
Value: 2.06 mg/kg bw/day
And So on for each Route.
To try and Achieve this I have the following code:
Public Sub GetContents3()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLReq.Open "Get", "https://echa.europa.eu/registration-dossier/-/registered-dossier/13817/7/1", False
XMLReq.send
HTMLDoc.body.innerHTML = XMLReq.responseText
Set SubSectList = HTMLDoc.getElementById("SectionContent")
Set SubSects = SubSectList.getElementsByTagName("dt")
For Each SubSect In SubSects
Debug.Print SubSect.innerText & " " & SubSect.NextSibling.innerText
Next SubSect
End Sub
This is better but now it doesn't pull the subheadings.
I need someway to pull the information from multiple tags without overlapping data however I am unsure how to do this and so far attempts have just caused errors.
If anyone knows how I can modify the code to achieve the desired output above that would be great.

You can specify a css selector pattern to match the relevant tags, then during a loop over returned nodes, check the tagName, if DD or DT you need to combine into a single line for output:
Option Explicit
Public Sub GetContents()
Dim http As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument
Set http = New MSXML2.XMLHTTP60: Set html = New MSHTML.HTMLDocument
With http
.Open "GET", "https://echa.europa.eu/registration-dossier/-/registered-dossier/13817/7/1", False
.send
html.body.innerHTML = .responseText
End With
Dim nodeList As MSHTML.IHTMLDOMChildrenCollection, i As Long, r As Long, concat As String
Set nodeList = html.querySelectorAll("#SectionContent h3,#SectionContent h4,#SectionContent h5,#SectionContent h6, #SectionContent dt, #SectionContent dd")
For i = 1 To nodeList.Length - 1
Select Case nodeList.Item(i).tagName
Case "DT"
concat = nodeList.Item(i).innerText
Case "DD"
concat = concat & Chr$(32) & nodeList.Item(i).innerText
r = r + 1
ActiveSheet.Cells(r, 1) = concat
Case Else
concat = vbNullString
r = r + 1
ActiveSheet.Cells(r, 1) = nodeList.Item(i).innerText
End Select
Next
End Sub
Now, I would like something tidier than the requested list, so how about the following tabulation?
Option Explicit
Public Sub GetContents()
Dim http As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument
Set http = New MSXML2.XMLHTTP60: Set html = New MSHTML.HTMLDocument
With http
.Open "GET", "https://echa.europa.eu/registration-dossier/-/registered-dossier/13817/7/1", False
.send
html.body.innerHTML = .responseText
End With
Dim nodeList As MSHTML.IHTMLDOMChildrenCollection, i As Long
Dim c As Long, r As Long, tag As String
Set nodeList = html.querySelectorAll("#SectionContent h3,#SectionContent h4,#SectionContent h5,#SectionContent h6, #SectionContent dt, #SectionContent dd")
r = 1
For i = 1 To nodeList.Length - 1
tag = nodeList.Item(i).tagName
Select Case tag
Case "DT"
c = 5
Case "DD"
c = 6
r = r + 1
Case Else
c = Right$(tag, 1) - 2
End Select
ActiveSheet.Cells(IIf(tag = "DD", r - 1, r), c) = nodeList.Item(i).innerText
Next
End Sub

Web Scraping ETFs Daily Data VBA

I'm trying to web scrape some daily info of differents ETFs. I found that https://www.marketwatch.com/ have a accurate info.
The most relevant info is the open Price, outstanding shares, NAV, total assets of the ETF.
Here is the link for IVV US Equity: https://www.marketwatch.com/investing/fund/ivv
I have web scraped with VBA before but the HTML of the pages I had used are different, I don't know if this is because some values of the ETFs (such as Price and Taded Volume) change constantly.
The idea is to create a code to extract relevant info and create a data base to analyze Macroeconomics factor using the ETFs as market indicators of flows between countries, regions, etc...
Mi first approach would be with VBA but after I get more into the data I would like to try with Python (after I get more conffident with it) to automate the webscraping process on a daily basis.
I am open to any suggestion or any other website that could be useful (I have tried with Yahoo Finance and Morningstar and I get the same problema with the HTML code).
This is my poor code:
Sub Get_Data()
Dim ticker As String, enlace As String
ticker = ThisWorkbook.Worksheets("ETFs").Cells(2, 2).Value 'IVV
'link = "https://www.morningstar.com/etfs/arcx/" & ticker & "/quote.html"
'link = "https://finance.yahoo.com/quote/" & ticker & "?p=" & ticker
link = "https://www.marketwatch.com/investing/fund/" & ticker
Application.ScreenUpdating = False
Dim x As Integer
x = ThisWorkbook.Worksheets("ETFs").Cells(Rows.Count, 1).End(xlUp).Row
'Dim i As Integer
'For i = 2 To x
Dim total_net_assets As Variant, open_price As Variant, NAV As Variant, shares_out
Set ie = CreateObject("InternetExplorer.application")
With ie
.Visible = False
.navigate link
While .Busy Or .readyState < 4: DoEvents: Wend
Do
DoEvents
On Error Resume Next
' Here is where I get the problem of not knowing how to reference the values I need because the class name appears repeatedly
total_net_assets = .document.getElementsByClassName("").Value
open_price = .document.getElementByClassName("price").Value
NAV = .document.getElementByClassName("").Value
shares_out = .document.getElementByClassName("kv__value kv__primary ").Value
On Error GoTo 0
Loop
End With
ThisWorkbook.Worksheets("ETFs").Cells(2, 13).Value = total_net_assets
ThisWorkbook.Worksheets("ETFs").Cells(2, 14).Value = NAV
ThisWorkbook.Worksheets("ETFs").Cells(2, 15).Value = open_price
ThisWorkbook.Worksheets("ETFs").Cells(2, 16).Value = shares_out
ie.Quit
'Next i
Application.ScreenUpdating = True
End Sub

Access method:
I use XMLHTTP requests as much faster than opening IE.
Code notes:
The following reads in fund short codes from Sheet1 column A, starting in A2, into an array. You can easily extend this adding more funds into column A.
This array is looped issuing XMLHTTP requests by concatenating the fund code into the BASE_URL variable.
I use a class, clsHTTP, to hold the XMLHTTP object to be efficient - no need to keep creating and destroying the object.
I provide this class with two methods. One to retrieve the target page innerHTML (GetString), and the other to extract the required info if available (GetInfo). I use a dictionary to test if the searched for labels are present. If present I grab the associated value. If not, I have a placeholder vbNullString in the dictionary.
I add each scraped result into a collection called results. At the end I loop this writing out to the sheet. By keeping most of the work in memory this provides for much faster scraping.
Retrieving info from HTML:
The labels e.g. Open, and values come in pairs.
You can generate a nodeList (think collection as with getElementsByClassName) by using querySelectorAll method to apply a class CSS selector to gather the label elements by their class name kv__label. The "." is the class selector.
Set labels = .querySelectorAll(".kv__label") '<== nodeList of labels
You do the same to get the associated values:
Set values = .querySelectorAll(".kv__value.kv__primary") '<== nodeList of associated values. Same length as labels nodeList so can use same index to retrieve associated label/value pairs from each nodeList.
You loop the labels using the dictionary in the clsHTTP method .GetInfo to see if you searched for labels are present, if they are, the associated value is retrieved from values by using the same index as where the label was found in the nodeList labels, and the dictionary vbNullString value for that label is updated with the actual retrieved value, else it is left as vbNullString.
Sample results:
VBA:
Class module clsHTTP:
Option Explicit
Private http As Object
Private Sub Class_Initialize()
Set http = CreateObject("MSXML2.XMLHTTP")
End Sub
Public Function GetString(ByVal url As String) As String
Dim sResponse As String
With http
.Open "GET", url, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
GetString = sResponse
End With
End Function
Public Function GetInfo(ByVal html As HTMLDocument) As Object
Dim dict As Object, i As Long
Set dict = CreateObject("Scripting.Dictionary")
dict.Add "Open", vbNullString
dict.Add "Shares Outstanding", vbNullString
dict.Add "Total Net Assets", vbNullString
dict.Add "NAV", vbNullString
Dim values As Object, labels As Object
With html
Set values = .querySelectorAll(".kv__value.kv__primary")
Set labels = .querySelectorAll(".kv__label")
For i = 0 To labels.Length - 1
If dict.Exists(labels.item(i).innerText) Then dict(labels.item(i).innerText) = values.item(i).innerText
Next
End With
Set GetInfo = dict
End Function
Standard module 1:
Option Explicit
Public Sub GetFundInfo()
Dim sResponse As String, html As HTMLDocument, http As clsHTTP, i As Long
Dim headers(), funds(), url As String, results As Collection, ws As Worksheet
Const BASE_URL As String = "https://www.marketwatch.com/investing/fund/"
Application.ScreenUpdating = False
headers = Array("Open", "Shares Outstanding", "Total Net Assets", "NAV")
Set results = New Collection
Set http = New clsHTTP
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
funds = Application.Transpose(ws.Range("A2:A3").Value) '<== Change the range here to the single column range containing your dotNums.
For i = LBound(funds) To UBound(funds)
If Not IsEmpty(funds(i)) Then
url = BASE_URL & funds(i)
html.body.innerHTML = http.GetString(url)
results.Add http.GetInfo(html).Items
End If
Next
If results.Count > 0 Then
Dim item As Variant, r As Long, c As Long
r = 2: c = 2
With ws
.Cells(1, c).Resize(1, UBound(headers) + 1) = headers
For Each item In results
.Cells(r, c).Resize(1, UBound(item) + 1) = item
r = r + 1
Next
End With
End If
Application.ScreenUpdating = True
End Sub
Set-up:
Without using a class:
Option Explicit
Public Sub GetFundInfo()
Dim sResponse As String, html As HTMLDocument, i As Long
Dim headers(), funds(), url As String, results As Collection, ws As Worksheet
Const BASE_URL As String = "https://www.marketwatch.com/investing/fund/"
Application.ScreenUpdating = False
headers = Array("Open", "Shares Outstanding", "Total Net Assets", "NAV")
Set results = New Collection
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
funds = Application.Transpose(ws.Range("A2:A3").Value) '<== Change the range here to the single column range containing your dotNums.
For i = LBound(funds) To UBound(funds)
If Not IsEmpty(funds(i)) Then
url = BASE_URL & funds(i)
html.body.innerHTML = GetString(url)
results.Add GetInfo(html).Items
End If
Next
If results.Count > 0 Then
Dim item As Variant, r As Long, c As Long
r = 2: c = 2
With ws
.Cells(1, c).Resize(1, UBound(headers) + 1) = headers
For Each item In results
.Cells(r, c).Resize(1, UBound(item) + 1) = item
r = r + 1
Next
End With
End If
Application.ScreenUpdating = True
End Sub
Public Function GetString(ByVal url As String) As String
Dim http As Object
Set http = CreateObject("MSXML2.XMLHTTP")
Dim sResponse As String
With http
.Open "GET", url, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
GetString = sResponse
End With
End Function
Public Function GetInfo(ByVal html As HTMLDocument) As Object
Dim dict As Object, i As Long
Set dict = CreateObject("Scripting.Dictionary")
dict.Add "Open", vbNullString
dict.Add "Shares Outstanding", vbNullString
dict.Add "Total Net Assets", vbNullString
dict.Add "NAV", vbNullString
Dim values As Object, labels As Object
With html
Set values = .querySelectorAll(".kv__value.kv__primary")
Set labels = .querySelectorAll(".kv__label")
For i = 0 To labels.Length - 1
If dict.Exists(labels.item(i).innerText) Then dict(labels.item(i).innerText) = values.item(i).innerText
Next
End With
Set GetInfo = dict
End Function

Okay, so you will need to create two loops. You can just keep reusing the elem0, elem1, and elemColl(1) variables for each price point you need - just make sure to reset bFoundIt to False for each new iteration so you do not exit the For Loops early.
For your total_net_assets var, you will first loop the class of kv__item. You will then need to loop each class collection of kv__label within the kv__item's elements and stop when you match the innerText: Total Net Assets. Once you match this, you will use the first coll obj elem0 to get the kv__value kv__primary class name for it.
Dim IE As Object, elem0 As Object, elem1 As Object, i As Long, bFoundIt As Boolean
Set IE = CreateObject("InternetExplorer.application")
With IE
.Visible = False
.navigate link
While .Busy Or .readyState < 4: DoEvents: Wend
DoEvents
bFoundIt = False
For Each elem0 In .document.getElementsByClassName("kv__item")
For Each elem1 In elem0.getElementsByClassName("kv__label")
If elem1.innerText = "Total Net Assets" Then
bFoundIt = True
total_net_assets = elem0.getElementsByClassName("kv__value kv__primary ")(0).innerText
Exit For
End If
Next elem1
If bFoundIt Then Exit For
Next elem0

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

VBA - web scraping can not get HTMLElement innerText - excel

Related

Historical Yahoo Finance API On Fritz Again?

Scrape economic data from investing website

How to skip a row in Excel with missing html tag using VBA

Getting elements using multiple tags

Web Scraping ETFs Daily Data VBA

Categories

Resources