How to parse XML in VBA and retrieve specific values - excel

I've already spent two weeks searching unsuccessfully how to parse one specific XML and fetch just few values. I already tried every single code on internet until I found one that solved part of my problem.
The XML i'm trying to fetch it's from U.S Department of Agriculture, and is free to access.
https://apps.fas.usda.gov/psdonline/app/index.html#/app/about
Dim xmlDoc As MSXML2.DOMDocument60
Dim xmlNode As MSXML2.IXMLDOMNode
Dim xmlNodeList As MSXML2.IXMLDOMNodeList
Dim myNode As MSXML2.IXMLDOMNode
Dim URL As String, APIkey As String
APIkey = "8DB688F8-1E22-4031-B581-59C221ECDDA6"
URL = "https://apps.fas.usda.gov/PSDOnlineDataServices/api/CommodityData/GetCommodityDataByYear?commodityCode=2222000&marketYear=2018"
Set xmlDoc = New MSXML2.DOMDocument60
xmlDoc.async = False
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.SetRequestHeader "Accept", "text/xml"
.SetRequestHeader "API_KEY", APIkey
.Send
xmlDoc.loadXML .ResponseText
End With
Set xmlNodeList = xmlDoc.getElementsByTagName("*")
For Each xmlNode In xmlNodeList
For Each myNode In xmlNode.childNodes
If myNode.nodeType = NODE_TEXT Then
Debug.Print xmlNode.nodeName & "=" & xmlNode.text
End If
Next myNode
Next xmlNode
Set xmlDoc = Nothing
End Sub
The response of this code show the entire XML listed, but when I try to find one specific node, the code result it's nothing.
in
Set xmlNodeList = xmlDoc.getElementsByTagName("*")
I've tried to use the address "//AttributeDescription", but apparently just work using the "*".
I need to receive, for example, The response below:
AttributeDescription=Production
CountryName=Brazil
Value=0.00000
I did my best trying to get the right response and I also consider that the XML structure it's not in the right format due the lack of response when addressing...
Is there anything that I can do to solve this issue?

There are two separate issues here.
MSXML2 has issues using XPath when the XML document has a default namespace - see here for details. At the start of the downloaded document from the USDA site, there are some namespace declarations:
<ArrayOfCommodityData xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/PSDOnline.DataServices.Models">
There are two namespaces declared here. One with the prefix i and a default namespace that covers any element which does not have a namespace prefix. If you look at a "CalendarYear" entry in the XML document - <CalendarYear i:nil="true" /> - then you can see that "CalendarYear" is in the default namespace whereas "nil" is in the "i" namespace.
To make MSXML2 work with default namespaces, you have to declare a namespace which has the same URI as the default namespace. This is done using the SelectionNamespaces property of the XML document, like this:
xmlDoc.SetProperty "SelectionNamespaces", "xmlns:r='http://schemas.datacontract.org/2004/07/PSDOnline.DataServices.Models'"
I chose r as the namespace but the name you choose is irrelevant - it just has to be different from any other namespaces in the document.
This leads on to the second problem. You are using getElementsByTagName which just takes a tag name as a parameter but you are passing in an XPath string. To deal with an XPath string, you need to use SelectNodes instead and you need to use the namespace we added, like this:
Set xmlNodeList = xmlDoc.SelectNodes("//r:AttributeDescription")

It's a namespace issue I think. There are people more familiar with this who can likely fix how to add properly and then reference. I did try adding the two namespaces with the usual syntax .setProperty "SelectionNamespaces", namespace but still failed to set objects so guess I did something wrong.
An interim, less robust solution is as follows:
Option Explicit
Public Sub test()
Dim xmlDoc As MSXML2.DOMDocument60
Dim URL As String, APIkey As String
APIkey = "key"
URL = "https://apps.fas.usda.gov/PSDOnlineDataServices/api/CommodityData/GetCommodityDataByYear?commodityCode=2222000&marketYear=2018"
Set xmlDoc = New MSXML2.DOMDocument60
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.SetRequestHeader "Accept", "text/xml"
.SetRequestHeader "API_KEY", APIkey
.Send
xmlDoc.LoadXML .responseText
End With
Dim node As IXMLDOMElement, r As Long
For Each node In xmlDoc.SelectNodes("/*[name()='ArrayOfCommodityData']/*[name()='CommodityData']")
r = r + 1
With ActiveSheet
.Cells(r, 1) = node.ChildNodes(0).Text
.Cells(r, 2) = node.ChildNodes(6).Text
.Cells(r, 3) = node.ChildNodes(11).Text
End With
Next
End Sub

I've reached this solution, mixing the two answers, and sharing the code to help others.
First I set the property and then used the iteration to retrieve the values I needed, I don't know if this is the best solution, since I can't control the XML structure and if they change their file I'll need to return to this code.
I tried to work in a "Safety Line" to avoid any mistake in the output, but no problem for me to double check since I have access to the data itself.
If node.childNodes(0).text = "Production" And node.childNodes(6).text = "Argentina" Then
To ensure that the name and response will bring whatever I want.
Public Sub test_3()
Dim xmlDoc As MSXML2.DOMDocument60
Dim URL As String, APIkey As String
APIkey = "8DB688F8-1E22-4031-B581-59C221ECDDA6"
URL = "https://apps.fas.usda.gov/PSDOnlineDataServices/api/CommodityData/GetCommodityDataByYear?commodityCode=2222000&marketYear=2018"
Set xmlDoc = New MSXML2.DOMDocument60
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.SetRequestHeader "Accept", "text/xml"
.SetRequestHeader "API_KEY", APIkey
.Send
xmlDoc.loadXML .ResponseText
xmlDoc.SetProperty "SelectionNamespaces", "xmlns:r='http://schemas.datacontract.org/2004/07/PSDOnline.DataServices.Models'"
End With
Dim node As IXMLDOMElement, r As Long
For Each node In xmlDoc.selectNodes("//r:CommodityData")
If node.childNodes(0).text = "Production" And node.childNodes(6).text = "Argentina" Then
r = r + 1
Debug.Print node.childNodes(0).text
Debug.Print node.childNodes(6).text
Debug.Print node.LastChild.text
'With ActiveSheet
'.Cells(r, 1) = node.childNodes(0).text
'.Cells(r, 2) = node.childNodes(6).text
'.Cells(r, 3) = node.LastChild.text
'End With
End If
Next
End Sub
This solution return the following response in the DEBUGGER:
Production
Argentina
55300.0000
Exactly what I wanted.
Thanks again for the time and for sharing knowledge.

Related

How can we use http.Open "GET" to list items from a table in HTML?

I'm testing an idea that I had. It seems like I should be able to scrape out various HTML elements from a table in a website, but my code can't seem to find the table, which definitely seems to be there.
Sub TryThis()
Dim oHtml As HTMLDocument
Dim oElement As Object
Set oHtml = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population", False
.send
oHtml.body.innerHTML = .responseText
End With
Set myitem = oHtml.getElementsByClassName("wikitable sortable jquery-tablesorter")
i = 0
For Each oElement In myitem
Sheets("Sheet1").Range("A" & i + 1) = myitem(i).innerText
i = i + 1
Next oElement
End Sub
Essentially, I would like to loop through HTML items, print out, in cells, what is in the table named 'wikitable sortable jquery-tablesorter' Here is a screen shot that may help.
You were really close, I think the issue is the jquery-tablesorter class is being added by jQuery (or plugin) after the page is loaded via JS. So that class isn't present in the DOM when the content is pulled in by the web request, it's added after. So removing it from the search criteria, should fix the issue.
Here's what I came up to address this, and also move the table contents over a bit quicker. I just did the first instance of wikitable sortable classes, but should be possible to loop each table too.
Sub TryThis()
Dim oHtml As HTMLDocument
Dim oElement As Object
Dim htmlText As String
Set oHtml = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population", False
.send
oHtml.body.innerHTML = .responseText
End With
htmlText = oHtml.getElementsByClassName("wikitable sortable")(0).outerhtml
With CreateObject("new:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}") 'Clipboard
.SetText htmlText
.PutInClipboard
Sheets(1).Range("A1").Select
Sheets(1).PasteSpecial Format:="Unicode Text"
End With
End Sub

How to pull image from Amazon UK

I am trying to download the hi-res pictures from Amazon.co.uk. I tried the code given in the
[thread][1] and I am getting some issues.
the code by #QHarr works well for given Amazon.in website but when I try for Amazon.co.uk the
.querySelector("#landingImage").getAttribute("data-old-hires") returns nothing. here is the code I am testing.
Public Sub GetInfo()
Dim Html As HTMLDocument, results()
Set Html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.amazon.co.uk/dp/product/B01GFJWHZ0", True
.send
Html.body.innerHTML = .responseText
With ThisWorkbook.Worksheets("Sheet1")
.Cells(1, 2) = Html.querySelector("#productTitle").innerText
.Cells(1, 2).Offset(0, 1) = Html.querySelector("#landingImage").getAttribute("data-old-hires")
End With
End With
End Sub
any idea what exactly I am doing wrong?
My guess is the response html is mangled. You can easily regex out that bit from the response string however.
Note that the async argument is also set to False to allow time for data to load. I'm somewhat surprised your code got as far as you said it did. It should also have tripped over on the fact there is no match for that selector in the HTMLDocument.
Public Sub GetInfo()
'tools > references > Microsoft HTML Object Library
Dim re As Object, html As MSHTML.HTMLDocument, xhr As Object, s As String
Set re = CreateObject("VBScript.RegExp")
Set xhr = CreateObject("MSXML2.XMLHTTP")
Set html = New MSHTML.HTMLDocument
re.Pattern = """hiRes"":""(.*?)"""
With xhr
.Open "GET", "https://www.amazon.co.uk/dp/product/B01GFJWHZ0", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
html.body.innerHTML = s
End With
With ThisWorkbook.Worksheets("Sheet1")
.Cells(1, 2) = html.querySelector("#productTitle").innerText
.Cells(1, 3) = re.Execute(s)(0).SubMatches(0)
End With
End Sub
Regex:

How can I parse values from a dynamic webpage using Excel VBA when XML/XPath doesn't seem to work?

I am attempting to scrape values from a collection of webpages using an XPath to parse what I want out of the XML. I grab the full XPath from the element using Chrome but then when I use in the code it doesnt seem to select the node I am looking for. Also when I execute the XPath statement in the console it also does not return the node. Some other element XPaths work in console but not in VBA. Am I missing something? My simple test XML works ok. My attempts to use namespace in the XPath were also not successful. Code below with an example of one of the webpages and one of the elements of interest:
Sub test()
testXML = "<test example='hello'>hello</test>"
Dim oXMLHTTP As Object
Dim sPageHTML As String
Dim sURL As String
Dim XmlResponse As String
Dim strXML As String
Dim xNode As MSXML2.IXMLDOMNode
Dim xmlElement As MSXML2.IXMLDOMElement
Dim XDoc As MSXML2.DOMDocument60
sURL = "https://www.bestplaces.net/crime/zip-code/alaska/anchorage/99510"
Set oXMLHTTP = CreateObject("MSXML2.ServerXMLHTTP")
oXMLHTTP.SetOption(2) = 13056 'Disable CA messages
oXMLHTTP.Open "GET", sURL, False
oXMLHTTP.send
XmlResponse = oXMLHTTP.responseText
'strXML = testXML
strXML = XmlResponse
Set XDoc = New MSXML2.DOMDocument60
'XDoc.setProperty "SelectionNamespaces", "xmlns:a='http://www.w3.org/1999/xhtml'"
'XDoc.setProperty "SelectionNamespaces", "xmlns:a='http://www.w3.org/2000/svg'"
'XDoc.setProperty "SelectionNamespaces", "xmlns:a='http://www.w3.org/1999/xlink'"
XDoc.LoadXML (strXML)
'Set xNode = XDoc.SelectSingleNode("/test")
Set xNode = XDoc.SelectSingleNode("/html/body/form/div[7]/div[2]/div[2]/div[3]/div/div/div/div/div/svg/g[6]/g[1]/text/tspan[2]")
If xNode Is Nothing Then
MsgBox "Nothing"
Else: MsgBox xNode.text
End If
End Sub
You are getting html back. A quick look at the page source shows that value is populated dynamically, but should be available by regex out of responseText; so your xpath wouldn't work even if converted to equivalent path for html parser.
Option Explicit
Public Sub GetValue()
Dim http As Object, s As String, re As Object
Set http = CreateObject("MSXML2.XMLHTTP")
Set re = CreateObject("VBScript.RegExp")
With http
.Open "GET", "https://www.bestplaces.net/crime/zip-code/alaska/anchorage/99510", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
End With
With re
.Pattern = "data:\s?\[(.*?),"
Debug.Print .Execute(s)(0).SubMatches(0)
End With
End Sub
Regex explanation:

how to get inner text of html under id?

I am trying to pull data pull inner text under id in excel cell.
This is for XML code.
Sub getelementbyid()
Dim XMLpage As New MSXML2.XMLHTTP60
Dim hdoc As New MSHTML.HTMLDocument
Dim HBEs As MSHTML.IHTMLElementCollection
Dim HBE As MSHTML.IHTMLElement
Dim ha As String
XMLpage.Open "GET","https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=HAL", False
XMLpage.send
hdoc.body.innerHTML = XMLpage.responseText
ha = hdoc.getelementbyid("open").innerText
Range("K11").Value = ha
Debug.Print ha
End Sub
I expect output value, but it shows --.
Examine the response text. There is a difference in the way the page is rendered in the browser versus what is returned in the ResponseText.
I put the URL into a browser went into dev tools (F12), found the element, and noted the numeric value inside the HTML element.
Then I dumped the response text we're getting in VBA into a cell and copied the entire cell value into Notepad++. If you do that you'll see the initial value inside the #open element is indeed "--".
The real value appears to be getting written into the HTML via JavaScript, which is common practice. There is a JSON object at the top of the page, presumably injected into the document from the back-end of the website upon your request.
So you have to parse the JSON, not the HTML. I've provided code doing just that. Now, there may be a better way to do it, I feel this code is kind of "hacky" but it's getting the job done for your example URL.
Sub getelementbyid()
Dim XMLpage As New MSXML2.XMLHTTP60
Dim hdoc As New MSHTML.HTMLDocument
Dim HBEs As MSHTML.IHTMLElementCollection
Dim HBE As MSHTML.IHTMLElement
Dim ha As String
XMLpage.Open "GET", "https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=HAL", False
XMLpage.send
'// sample: ,"open":"681.05",
Dim token As String
token = """open"":"""
Dim startPosition As Integer
startPosition = InStr(1, XMLpage.responseText, token)
Dim endPosition As Integer
endPosition = InStr(startPosition, XMLpage.responseText, ",")
Dim prop As String
prop = Mid(XMLpage.responseText, startPosition, endPosition - startPosition)
prop = Replace(prop, """", vbNullString)
prop = Replace(prop, "open:", vbNullString)
Dim val As Double
val = CDbl(prop)
ha = val
Range("K11").Value = ha
Debug.Print ha
End Sub
Here are two methods. 1) Using regex on the return text. Usually frowned upon but perfectly serviceable here. 2) Traditional extract json string and use json parser to parse out value.
The data you want is stored in a json string found both on the webpage and the xmlhtttp response, under the same element:
This means you can treat the html as a string and target just the pattern for the open price using regex as shown below, or parse the xmlhttp request into an html parser, grab the required element, extract its innerText and trim off the whitespace, then pass to a json parser to extract the open price.
In both methods you want to avoid being served cached results so the following header is an important addition to attempt to mitigate for this:
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
There is no need for addtional cell formatting. Full value comes out for both your tickers.
Regex:
It is present in a json string in the response. You can regex it out easily from return text.
Regex explanation:
VBA:
Option Explicit
Public Sub GetClosePrice()
Dim ws As Worksheet, re As Object, p As String, r As String
Set ws = ThisWorkbook.Worksheets("Sheet1")
p = """open"":""(.*?)"""
Set re = CreateObject("VBScript.RegExp")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=HAL", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
If .Status = 200 Then
r = GetValue(re, .responseText, p)
Else
r = "Failed connection"
End If
End With
ws.Range("K11").Value = r
End Sub
Public Function GetValue(ByVal re As Object, ByVal inputString As String, ByVal pattern As String) As String
With re
.Global = True
.pattern = pattern
If .test(inputString) Then ' returns True if the regex pattern can be matched agaist the provided string
GetValue = .Execute(inputString)(0).submatches(0)
Else
GetValue = "Not found"
End If
End With
End Function
HTML and json parser:
This requires installing code for jsonparser from jsonconverter.bas in a standard module called JsonConverter and then going VBE>Tools>References>Add a reference to Microsoft Scripting Runtime and Microsoft HTML Object Library.
VBA:
Option Explicit
Public Sub GetClosePrice()
Dim ws As Worksheet, re As Object, r As String, json As Object
Set ws = ThisWorkbook.Worksheets("Sheet1")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=MRF", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
If .Status = 200 Then
Dim html As HTMLDocument
Set html = New HTMLDocument
html.body.innerHTML = .responseText
Set json = JsonConverter.ParseJson(Trim$(html.querySelector("#responseDiv").innerText))
r = json("data")(1)("open")
Else
r = "Failed connection"
End If
End With
ws.Range("K11").Value = r
End Sub

Excel VBA source code for extracting data from a URL

I want to extract the title of every news item displayed on "http://pib.nic.in/newsite/erelease.aspx?relid=58313" website using Excel VBA. I have written a code using getelementsbyclassname("contentdiv"). But the debugger is showing a error pertaining to that the object doesn't support...I want to extract the information items of every relid..which is there in the URL as well...
Cold scrapes like this are generally handled more efficiently with a XMLHTTP pull. This requires the addition of a few libraries to the VBE's Tools ► References. The code below needs Microsoft XML, v6.0, Microsoft HTML Object library and Microsoft Internet Controls. Might not need the last one but you probably will if you expand the code beyond what is supplied.
Public Const csURL As String = "http://pib.nic.in/newsite/erelease.aspx?relid=×ID×"
Sub scrape_PIBNIC()
Dim htmlBDY As HTMLDocument, xmlHTTP As MSXML2.ServerXMLHTTP60
Dim i As Long, u As String, iDIV As Long
On Error GoTo CleanUp
Set xmlHTTP = New MSXML2.ServerXMLHTTP60
Set htmlBDY = New HTMLDocument
For i = 58313 To 58313
htmlBDY.body.innerHTML = vbNullString
With xmlHTTP
u = Replace(csURL, "×ID×", i)
'Debug.Print u
.Open "GET", u, False
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded; charset=UTF-8"
.send
If .Status <> 200 Then GoTo CleanUp
htmlBDY.body.innerHTML = .responseText
For iDIV = 0 To (htmlBDY.getElementsByClassName("contentdiv").Length - 1)
If CBool(htmlBDY.getElementsByClassName("contentdiv")(iDIV).getElementsByTagName("span").Length) Then
Sheets("Sheet1").Cells(Rows.Count, 1).End(xlUp).Offset(1, 0) = _
htmlBDY.getElementsByClassName("contentdiv")(iDIV).getElementsByTagName("span")(0).innerText
End If
Next iDIV
End With
Next i
CleanUp:
Set htmlBDY = Nothing
Set xmlHTTP = Nothing
End Sub
That should be enough to get you started. The site you are targeting requires that charset=UTF-8 be added to the request. I had no success without it. I strongly suspect that this may have been the source of your object doesn't support error.

Resources