How can I parse values from a dynamic webpage using Excel VBA when XML/XPath doesn't seem to work? - excel

I am attempting to scrape values from a collection of webpages using an XPath to parse what I want out of the XML. I grab the full XPath from the element using Chrome but then when I use in the code it doesnt seem to select the node I am looking for. Also when I execute the XPath statement in the console it also does not return the node. Some other element XPaths work in console but not in VBA. Am I missing something? My simple test XML works ok. My attempts to use namespace in the XPath were also not successful. Code below with an example of one of the webpages and one of the elements of interest:
Sub test()
testXML = "<test example='hello'>hello</test>"
Dim oXMLHTTP As Object
Dim sPageHTML As String
Dim sURL As String
Dim XmlResponse As String
Dim strXML As String
Dim xNode As MSXML2.IXMLDOMNode
Dim xmlElement As MSXML2.IXMLDOMElement
Dim XDoc As MSXML2.DOMDocument60
sURL = "https://www.bestplaces.net/crime/zip-code/alaska/anchorage/99510"
Set oXMLHTTP = CreateObject("MSXML2.ServerXMLHTTP")
oXMLHTTP.SetOption(2) = 13056 'Disable CA messages
oXMLHTTP.Open "GET", sURL, False
oXMLHTTP.send
XmlResponse = oXMLHTTP.responseText
'strXML = testXML
strXML = XmlResponse
Set XDoc = New MSXML2.DOMDocument60
'XDoc.setProperty "SelectionNamespaces", "xmlns:a='http://www.w3.org/1999/xhtml'"
'XDoc.setProperty "SelectionNamespaces", "xmlns:a='http://www.w3.org/2000/svg'"
'XDoc.setProperty "SelectionNamespaces", "xmlns:a='http://www.w3.org/1999/xlink'"
XDoc.LoadXML (strXML)
'Set xNode = XDoc.SelectSingleNode("/test")
Set xNode = XDoc.SelectSingleNode("/html/body/form/div[7]/div[2]/div[2]/div[3]/div/div/div/div/div/svg/g[6]/g[1]/text/tspan[2]")
If xNode Is Nothing Then
MsgBox "Nothing"
Else: MsgBox xNode.text
End If
End Sub

You are getting html back. A quick look at the page source shows that value is populated dynamically, but should be available by regex out of responseText; so your xpath wouldn't work even if converted to equivalent path for html parser.
Option Explicit
Public Sub GetValue()
Dim http As Object, s As String, re As Object
Set http = CreateObject("MSXML2.XMLHTTP")
Set re = CreateObject("VBScript.RegExp")
With http
.Open "GET", "https://www.bestplaces.net/crime/zip-code/alaska/anchorage/99510", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
End With
With re
.Pattern = "data:\s?\[(.*?),"
Debug.Print .Execute(s)(0).SubMatches(0)
End With
End Sub
Regex explanation:

Related

How can we use http.Open "GET" to list items from a table in HTML?

I'm testing an idea that I had. It seems like I should be able to scrape out various HTML elements from a table in a website, but my code can't seem to find the table, which definitely seems to be there.
Sub TryThis()
Dim oHtml As HTMLDocument
Dim oElement As Object
Set oHtml = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population", False
.send
oHtml.body.innerHTML = .responseText
End With
Set myitem = oHtml.getElementsByClassName("wikitable sortable jquery-tablesorter")
i = 0
For Each oElement In myitem
Sheets("Sheet1").Range("A" & i + 1) = myitem(i).innerText
i = i + 1
Next oElement
End Sub
Essentially, I would like to loop through HTML items, print out, in cells, what is in the table named 'wikitable sortable jquery-tablesorter' Here is a screen shot that may help.
You were really close, I think the issue is the jquery-tablesorter class is being added by jQuery (or plugin) after the page is loaded via JS. So that class isn't present in the DOM when the content is pulled in by the web request, it's added after. So removing it from the search criteria, should fix the issue.
Here's what I came up to address this, and also move the table contents over a bit quicker. I just did the first instance of wikitable sortable classes, but should be possible to loop each table too.
Sub TryThis()
Dim oHtml As HTMLDocument
Dim oElement As Object
Dim htmlText As String
Set oHtml = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population", False
.send
oHtml.body.innerHTML = .responseText
End With
htmlText = oHtml.getElementsByClassName("wikitable sortable")(0).outerhtml
With CreateObject("new:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}") 'Clipboard
.SetText htmlText
.PutInClipboard
Sheets(1).Range("A1").Select
Sheets(1).PasteSpecial Format:="Unicode Text"
End With
End Sub

Parse line from HTML response in Excel VBA

I am looking to parse a 'WinHttpRequest' response in Excel VBA to pull a specific line from the HTML response. Here is the code for the HTTP request.
Function getSetName(ByVal setNUMBER As String) As String
Dim oRequest As Object
Dim xmlResponse As String
Dim setDescription As String
Dim setName As String
Set oRequest = CreateObject("WinHttp.WinHttpRequest.5.1")
oRequest.Open "GET", "https://brickset.com/sets/75192" '& setNUMBER
oRequest.Send
xmlResponse = oRequest.ResponseText
'parse xml response here
getSetName = setName
End Function
I am looking to parse only the line with the HTML tag 'meta name=""description""' to a string which I will later pull information from.
Any help or direction on how to parse this single line would be appreciated.
Try
Sub Test_GetSetName()
Debug.Print GetSetName("75192")
End Sub
Function GetSetName(ByVal SetNUMBER As String) As String
Dim html As New MSHTML.HTMLDocument
With CreateObject("WinHttp.WinHttpRequest.5.1")
.Open "GET", "https://brickset.com/sets/" & SetNUMBER, False
.send
html.body.innerHTML = .responseText
End With
GetSetName = html.querySelector("[name='description']").Content
End Function

How to parse XML in VBA and retrieve specific values

I've already spent two weeks searching unsuccessfully how to parse one specific XML and fetch just few values. I already tried every single code on internet until I found one that solved part of my problem.
The XML i'm trying to fetch it's from U.S Department of Agriculture, and is free to access.
https://apps.fas.usda.gov/psdonline/app/index.html#/app/about
Dim xmlDoc As MSXML2.DOMDocument60
Dim xmlNode As MSXML2.IXMLDOMNode
Dim xmlNodeList As MSXML2.IXMLDOMNodeList
Dim myNode As MSXML2.IXMLDOMNode
Dim URL As String, APIkey As String
APIkey = "8DB688F8-1E22-4031-B581-59C221ECDDA6"
URL = "https://apps.fas.usda.gov/PSDOnlineDataServices/api/CommodityData/GetCommodityDataByYear?commodityCode=2222000&marketYear=2018"
Set xmlDoc = New MSXML2.DOMDocument60
xmlDoc.async = False
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.SetRequestHeader "Accept", "text/xml"
.SetRequestHeader "API_KEY", APIkey
.Send
xmlDoc.loadXML .ResponseText
End With
Set xmlNodeList = xmlDoc.getElementsByTagName("*")
For Each xmlNode In xmlNodeList
For Each myNode In xmlNode.childNodes
If myNode.nodeType = NODE_TEXT Then
Debug.Print xmlNode.nodeName & "=" & xmlNode.text
End If
Next myNode
Next xmlNode
Set xmlDoc = Nothing
End Sub
The response of this code show the entire XML listed, but when I try to find one specific node, the code result it's nothing.
in
Set xmlNodeList = xmlDoc.getElementsByTagName("*")
I've tried to use the address "//AttributeDescription", but apparently just work using the "*".
I need to receive, for example, The response below:
AttributeDescription=Production
CountryName=Brazil
Value=0.00000
I did my best trying to get the right response and I also consider that the XML structure it's not in the right format due the lack of response when addressing...
Is there anything that I can do to solve this issue?
There are two separate issues here.
MSXML2 has issues using XPath when the XML document has a default namespace - see here for details. At the start of the downloaded document from the USDA site, there are some namespace declarations:
<ArrayOfCommodityData xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/PSDOnline.DataServices.Models">
There are two namespaces declared here. One with the prefix i and a default namespace that covers any element which does not have a namespace prefix. If you look at a "CalendarYear" entry in the XML document - <CalendarYear i:nil="true" /> - then you can see that "CalendarYear" is in the default namespace whereas "nil" is in the "i" namespace.
To make MSXML2 work with default namespaces, you have to declare a namespace which has the same URI as the default namespace. This is done using the SelectionNamespaces property of the XML document, like this:
xmlDoc.SetProperty "SelectionNamespaces", "xmlns:r='http://schemas.datacontract.org/2004/07/PSDOnline.DataServices.Models'"
I chose r as the namespace but the name you choose is irrelevant - it just has to be different from any other namespaces in the document.
This leads on to the second problem. You are using getElementsByTagName which just takes a tag name as a parameter but you are passing in an XPath string. To deal with an XPath string, you need to use SelectNodes instead and you need to use the namespace we added, like this:
Set xmlNodeList = xmlDoc.SelectNodes("//r:AttributeDescription")
It's a namespace issue I think. There are people more familiar with this who can likely fix how to add properly and then reference. I did try adding the two namespaces with the usual syntax .setProperty "SelectionNamespaces", namespace but still failed to set objects so guess I did something wrong.
An interim, less robust solution is as follows:
Option Explicit
Public Sub test()
Dim xmlDoc As MSXML2.DOMDocument60
Dim URL As String, APIkey As String
APIkey = "key"
URL = "https://apps.fas.usda.gov/PSDOnlineDataServices/api/CommodityData/GetCommodityDataByYear?commodityCode=2222000&marketYear=2018"
Set xmlDoc = New MSXML2.DOMDocument60
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.SetRequestHeader "Accept", "text/xml"
.SetRequestHeader "API_KEY", APIkey
.Send
xmlDoc.LoadXML .responseText
End With
Dim node As IXMLDOMElement, r As Long
For Each node In xmlDoc.SelectNodes("/*[name()='ArrayOfCommodityData']/*[name()='CommodityData']")
r = r + 1
With ActiveSheet
.Cells(r, 1) = node.ChildNodes(0).Text
.Cells(r, 2) = node.ChildNodes(6).Text
.Cells(r, 3) = node.ChildNodes(11).Text
End With
Next
End Sub
I've reached this solution, mixing the two answers, and sharing the code to help others.
First I set the property and then used the iteration to retrieve the values I needed, I don't know if this is the best solution, since I can't control the XML structure and if they change their file I'll need to return to this code.
I tried to work in a "Safety Line" to avoid any mistake in the output, but no problem for me to double check since I have access to the data itself.
If node.childNodes(0).text = "Production" And node.childNodes(6).text = "Argentina" Then
To ensure that the name and response will bring whatever I want.
Public Sub test_3()
Dim xmlDoc As MSXML2.DOMDocument60
Dim URL As String, APIkey As String
APIkey = "8DB688F8-1E22-4031-B581-59C221ECDDA6"
URL = "https://apps.fas.usda.gov/PSDOnlineDataServices/api/CommodityData/GetCommodityDataByYear?commodityCode=2222000&marketYear=2018"
Set xmlDoc = New MSXML2.DOMDocument60
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.SetRequestHeader "Accept", "text/xml"
.SetRequestHeader "API_KEY", APIkey
.Send
xmlDoc.loadXML .ResponseText
xmlDoc.SetProperty "SelectionNamespaces", "xmlns:r='http://schemas.datacontract.org/2004/07/PSDOnline.DataServices.Models'"
End With
Dim node As IXMLDOMElement, r As Long
For Each node In xmlDoc.selectNodes("//r:CommodityData")
If node.childNodes(0).text = "Production" And node.childNodes(6).text = "Argentina" Then
r = r + 1
Debug.Print node.childNodes(0).text
Debug.Print node.childNodes(6).text
Debug.Print node.LastChild.text
'With ActiveSheet
'.Cells(r, 1) = node.childNodes(0).text
'.Cells(r, 2) = node.childNodes(6).text
'.Cells(r, 3) = node.LastChild.text
'End With
End If
Next
End Sub
This solution return the following response in the DEBUGGER:
Production
Argentina
55300.0000
Exactly what I wanted.
Thanks again for the time and for sharing knowledge.

how to get inner text of html under id?

I am trying to pull data pull inner text under id in excel cell.
This is for XML code.
Sub getelementbyid()
Dim XMLpage As New MSXML2.XMLHTTP60
Dim hdoc As New MSHTML.HTMLDocument
Dim HBEs As MSHTML.IHTMLElementCollection
Dim HBE As MSHTML.IHTMLElement
Dim ha As String
XMLpage.Open "GET","https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=HAL", False
XMLpage.send
hdoc.body.innerHTML = XMLpage.responseText
ha = hdoc.getelementbyid("open").innerText
Range("K11").Value = ha
Debug.Print ha
End Sub
I expect output value, but it shows --.
Examine the response text. There is a difference in the way the page is rendered in the browser versus what is returned in the ResponseText.
I put the URL into a browser went into dev tools (F12), found the element, and noted the numeric value inside the HTML element.
Then I dumped the response text we're getting in VBA into a cell and copied the entire cell value into Notepad++. If you do that you'll see the initial value inside the #open element is indeed "--".
The real value appears to be getting written into the HTML via JavaScript, which is common practice. There is a JSON object at the top of the page, presumably injected into the document from the back-end of the website upon your request.
So you have to parse the JSON, not the HTML. I've provided code doing just that. Now, there may be a better way to do it, I feel this code is kind of "hacky" but it's getting the job done for your example URL.
Sub getelementbyid()
Dim XMLpage As New MSXML2.XMLHTTP60
Dim hdoc As New MSHTML.HTMLDocument
Dim HBEs As MSHTML.IHTMLElementCollection
Dim HBE As MSHTML.IHTMLElement
Dim ha As String
XMLpage.Open "GET", "https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=HAL", False
XMLpage.send
'// sample: ,"open":"681.05",
Dim token As String
token = """open"":"""
Dim startPosition As Integer
startPosition = InStr(1, XMLpage.responseText, token)
Dim endPosition As Integer
endPosition = InStr(startPosition, XMLpage.responseText, ",")
Dim prop As String
prop = Mid(XMLpage.responseText, startPosition, endPosition - startPosition)
prop = Replace(prop, """", vbNullString)
prop = Replace(prop, "open:", vbNullString)
Dim val As Double
val = CDbl(prop)
ha = val
Range("K11").Value = ha
Debug.Print ha
End Sub
Here are two methods. 1) Using regex on the return text. Usually frowned upon but perfectly serviceable here. 2) Traditional extract json string and use json parser to parse out value.
The data you want is stored in a json string found both on the webpage and the xmlhtttp response, under the same element:
This means you can treat the html as a string and target just the pattern for the open price using regex as shown below, or parse the xmlhttp request into an html parser, grab the required element, extract its innerText and trim off the whitespace, then pass to a json parser to extract the open price.
In both methods you want to avoid being served cached results so the following header is an important addition to attempt to mitigate for this:
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
There is no need for addtional cell formatting. Full value comes out for both your tickers.
Regex:
It is present in a json string in the response. You can regex it out easily from return text.
Regex explanation:
VBA:
Option Explicit
Public Sub GetClosePrice()
Dim ws As Worksheet, re As Object, p As String, r As String
Set ws = ThisWorkbook.Worksheets("Sheet1")
p = """open"":""(.*?)"""
Set re = CreateObject("VBScript.RegExp")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=HAL", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
If .Status = 200 Then
r = GetValue(re, .responseText, p)
Else
r = "Failed connection"
End If
End With
ws.Range("K11").Value = r
End Sub
Public Function GetValue(ByVal re As Object, ByVal inputString As String, ByVal pattern As String) As String
With re
.Global = True
.pattern = pattern
If .test(inputString) Then ' returns True if the regex pattern can be matched agaist the provided string
GetValue = .Execute(inputString)(0).submatches(0)
Else
GetValue = "Not found"
End If
End With
End Function
HTML and json parser:
This requires installing code for jsonparser from jsonconverter.bas in a standard module called JsonConverter and then going VBE>Tools>References>Add a reference to Microsoft Scripting Runtime and Microsoft HTML Object Library.
VBA:
Option Explicit
Public Sub GetClosePrice()
Dim ws As Worksheet, re As Object, r As String, json As Object
Set ws = ThisWorkbook.Worksheets("Sheet1")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=MRF", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
If .Status = 200 Then
Dim html As HTMLDocument
Set html = New HTMLDocument
html.body.innerHTML = .responseText
Set json = JsonConverter.ParseJson(Trim$(html.querySelector("#responseDiv").innerText))
r = json("data")(1)("open")
Else
r = "Failed connection"
End If
End With
ws.Range("K11").Value = r
End Sub

How to get specific data from a tag from XML page?

I have the following code to get the response text from an XML page, but it returns everything in the page:
Private Sub Testing()
Dim xmlhttp As New MSXML2.XMLHTTP60, myurl As String
myurl = "http://schemas.xmlsoap.org/soap/envelope/"
xmlhttp.Open "GET", myurl, False
xmlhttp.send
newstring = xmlhttp.responseText
Sheet1.Range("B2") = newstring
End Sub
The URL in myurl is for example only. The URL i'm trying to get is in an intranet site but this has a similar structure.
The following FIELD tag is not available in the URL above. I used the url as an example to show the type of document.
Let's say there's a tag like below in the middle of the page:
<FIELD NAME="str2">097cf4a8-2755-4c62-939c-9402e0a4e3e2</FIELD>
How do I get only this "097cf4a8-2755-4c62-939c-9402e0a4e3e2"? The str2 is unique.
Here is an example using a publicly available document. The principle of selecting by combination of tag and attribute value is shown.
Option Explicit
Public Sub Testing()
Dim xmlhttp As New MSXML2.XMLHTTP60, myurl As String
myurl = "http://www.chilkatsoft.com/xml-samples/bookstore.xml"
xmlhttp.Open "GET", myurl, False
xmlhttp.send
Dim xmlDoc As New MSXML2.DOMDocument60
Set xmlDoc = New MSXML2.DOMDocument60
xmlDoc.LoadXML xmlhttp.responseText
Debug.Print xmlDoc.SelectSingleNode("//userComment[#rating=""3""]").Text
End Sub
Source:
Output:

Resources