Cant retrieve element from page - excel

I'm having trouble trying to retrieve the IUPAC name of a chemical on the following page:
https://echa.europa.eu/brief-profile/-/briefprofile/100.000.685
I'd simply like the printed result to return as Benzene in this example.
The code below pulls all elements with className `
Public Sub GetContents()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.000.685", False
XMLReq.send
HTMLDoc.body.innerHTML = XMLReq.responseText
Set IUPACName = HTMLDoc.getElementsByClassName("col-sm-8")(0)
Debug.Print IUPACName.innerText
End Sub
This returns:
EC / List name: IUPAC name: benzene Substance names and other identifiers
Inspecting the page there doesn't seem to be any obvious identifier to just return Benzene. Wondering how people would go about this.
Here is an image of the Text I wish to pull.

I can't test on other Office versions but 2019, at least, you can use an attribute selector as follows:
Set IUPACName = HTMLDoc.querySelector("[title*=IUPAC]")
Debug.Print IUPACName.innerText
I was expecting to use:
Debug.Print IUPACName.NextSibling.NodeValue
So, that latter one maybe what you need on your Office version.
The world of mshtml.dll is quite topsy-turvy as moment.

Related

Scraping "href" using Chrome

long time listener first time poster,
I am hoping to get some help scraping the href attribute from a website using google chrome. I have searched and tried for hours and for the life of me cannot get the code to work.
This is the website: https://pool.pm/addr1qxlxmpqamdnzs9gpgvjnsxehu4pd95a9ddhhcuxadvzv69jjtu4lhppapqxxgtsxweackk6se5m3zp9qkadsu62de8uqrp3dk4/%409e9e948d
This is a snippet of HTML code that I am trying to retrieve.
One of the things I noticed is that "topics" returns empty values and is not pulling what i need to. So this makes the rest of my code irrelevant. I am sure I am missing something fundamental, but I cannot find it. Any help would be greatly appreciated.
My code is currently as follows:
Option Explicit
Sub openurl()
Dim myurl As String
Dim request As Object
Dim response As String
Dim html As New HTMLDocument
Dim topics As Object
Dim titleElem As Object
myurl = "https://pool.pm/addr1qxlxmpqamdnzs9gpgvjnsxehu4pd95a9ddhhcuxadvzv69jjtu4lhppapqxxgtsxweackk6se5m3zp9qkadsu62de8uqrp3dk4/%409e9e948d"
Set request = CreateObject("MSXML2.XMLHTTP")
request.Open "GET", myurl, False
request.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
request.send
response = StrConv(request.responseBody, vbUnicode)
html.body.innerHTML = response
Set topics = html.getElementsByClassName("hc ah cx s e wc ccx ccy lnk")
Sheets("main").Range("A3").Value = topics.getElementsByTagName("a").href
End Sub
The site is not generating HTML without JavaScript, so you have the wrong URL.
This currently has nothing to do with Chrome. You're making a simple HTTP Request.
This is the async resource that has the item you want:
https://pool.pm/wallet/addr1qxlxmpqamdnzs9gpgvjnsxehu4pd95a9ddhhcuxadvzv69jjtu4lhppapqxxgtsxweackk6se5m3zp9qkadsu62de8uqrp3dk4
It returns a json Response, one of the fields in an array of tokens.
You have to parse the json Response with VBA.
Entry 207 has the item you want.
You could probably just loop the tokens and access .name and .policy to generate all the href links you want.
{
...
tokens[207].policy = "9e9e948d01bc64e29c26fbf85922d8d80dbf987222ffb45a6fe9f480",
tokens[207].name = "DungeonLootersClubWeapon0112"
}

pulling href after getting elements

Hopefully a really easy easy question but this is something I always seem to run into issues when web scraping.
I'm webscraping from a database containing many chemical dossiers, some of which have a separate section for toxicological information and some do not. In this example the url provided is fixed as I know this does contain a link to the toxicological info and so will pull the "Sub" url from.
I wish to check if the website has this info by pulling this ur, and if not running conditional code to give a message saying no tox info etc..
Inspecting the page:
<li id="SubNav7_1" class="active"> Toxicological Summary </li>
I have navigated correctly to SubNav7 but I run into a runtime error 13 when trying to get the url.
Public Sub GetContents()
'Start ECHA Search via XML HTTP Request
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLReq.Open "Get", "https://echa.europa.eu/registration-dossier/-/registered-dossier/15460", False
XMLReq.send
HTMLDoc.body.innerHTML = XMLReq.responseText
'GetLink
Set link = HTMLDoc.getElementById("SubNav7_1").getAttribute("href")
Debug.Print link
End Sub
The expected output is https://echa.europa.eu/registration-dossier/-/registered-dossier/15460/7/1
If anyone could point how I can essentially get to the a tag attribute under SubNav7_1 that'd be great
When you print the whole website, you will notice that the href attribute you're looking for is not in the SubNav7_1 element. It's in a element inside it:
<li id="SubNav7_1">
Toxicological Summary
</li>
Therefore, you're getting an error accessing "href" attribute of the "li" element, because such an attribute does not exist.
If you're wondering, here's how I modified your code to see what's going on in the site you're scraping (and how I got the HTML shown above):
'Start ECHA Search via XML HTTP Request
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLReq.Open "Get", "https://echa.europa.eu/registration-dossier/-/registered-dossier/15460", False
XMLReq.send
HTMLDoc.body.innerHTML = XMLReq.responseText
'GetLink
Dim link As String
Debug.Print Mid(HTMLDoc.body.outerHTML, InStr(1, HTMLDoc.body.outerHTML, "SubNav7_1"), 150)
link = HTMLDoc.getElementById("SubNav7_1").getAttribute("href")
Debug.Print link

Forcing documentMode when using MSHTML.HTMLDocument

Can HTMLDocument be forced to a specific documentMode when using MSHTML in Excel?
So far, all properties and methods related to this seem to only return values and cannot be set (ex. documentMode, compatMode, compatible).
While scraping and parsing HTML, I'm getting different behaviours in Excel on other machines in the organization which is why I want to standardize as much as I can.
Code:
Dim doc As HTMLDocument
Set doc = New HTMLDocument
Debug.Print "compatMode: " & doc.compatMode
Debug.Print "documentMode: " & doc.documentMode
My machine:
compatMode: BackCompat
documentMode: 11
Other machines:
compatMode: BackCompat
documentMode: 5
For the systems I compared with, the OS builds and MS Office (O365) versions were the same as my machine. I also compared the version of msxml3.dll and msxml6.dll which were also the same with my machine.
Instead of MSXML2.XMLHTTP.6.0, I used an instance of InternetExplorer
Instead of instantiating various classes from MSHTML, I simply used the generic Object class. Instantiating anything from MSHTML would introduce different documentModes.
Code example:
'Get document using IE
Dim doc As HTMLDocument
...
set doc = ie.Document
'Old - Extracting rows
Dim element As MSHTML.HTMLGenericElement
For Each element In tableRows
'New - Extracting rows
Dim element As Object
For Each element In tableRows

VBA: Extracting XML data from XFA form and passing to XML parser

I am trying to extract the XML information from an XFA form using VBA.
Below code works to extract the XML data to a separate file, but it requires user interaction (the user is requested to give the XML file a name). I have given up trying to automate this without user interaction due to Adobe's "safe path" requirement (which seems impossible to bypass with a VBA automation).
Dim objPDDoc As New AcroPDDoc
Dim objJSO As Object
Dim strSafePath as String
strSafePath = ""
objPDDoc.Open (FileName)
Set objJSO = objPDDoc.GetJSObject
objJSO.xfa.host.exportdata strSafePath, 0
What I would rather do is to parse the XML information directly using MSXML2.DOMDocument60. I was hoping to be able to do something like this:
Dim XMLDoc As New MSXML2.DOMDocument60
If XMLDoc.Load(objJSO.xfa.host.exportdata) = True Then
Call funcParse(XMLDoc)
End if
However, loading XMLDoc with objJSO.xfa.host.exportdata doesn't work, and I cannot seem to figure out which - if any - possibilities there are to pass the XML information using any xfa.host methods/properties.
Any help is welcome - also telling me this is not possible in VBA.
Try something like this:
myXMLstring = "<XML>BLA</XML>"
Dim xmlDoc As MSXML2.DOMDocument60
Set xmlDoc = New MSXML2.DOMDocument60
xmlDoc.LoadXML myXMLstring
See for a better example: See e.g. this post: https://desmondoshiwambo.wordpress.com/2012/07/03/how-to-load-xml-from-a-local-file-with-msxml2-domdocument-6-0-and-loadxml-using-vba/
Original poster here. After about a year of looking into this on-and-off, I found the solution.
After having accessed the JavaScript object through AccroPDDoc.GetJSObject, I can extract the nested XML as a string by using objJSO.xfa.this.saveXML.
This way, I don't have to first save the nested XML to file (which would require user interaction) - instead I can immediatly extract the nested XML and pass it to the parser.
Dim objPDDoc as New AcroPDDoc
Dim objJSO as Object
Dim XMLDoc As New MSXML2.DOMDocument60
ObjPDDoc.Open (Filename)
Set objJSO = objPDDoc.GetJSObject
If XMLDoc.LoadXML (objJSO.xfa.this.saveXML) = True then
ParseXML(XMLDoc)
End if

Error when scraping HTML: Object variable or With block variable not set

I copied code to get stock data from hsbc derivatives. (https://www.youtube.com/watch?v=IOzHacoP-u4)
I changed the URL (to hsbc) and that I want to find the value based on the ID, not the class name.
I changed the ID name.
I get
"Run Time Error-91:
Object variable or With block variable not set".
Sub Get_Web_Data()
Dim request As Object
Dim response As String
Dim html As New HTMLDocument
Dim website As String
Dim price As Variant
' Website to go to.
website = "https://www.hsbc-zertifikate.de/home/details#!/isin:DE000TR8S293"
' Create the object that will make the webpage request.
Set request = CreateObject("MSXML2.XMLHTTP")
' Where to go and how to go there - probably don't need to change this.
request.Open "GET", website, False
' Get fresh data.
'request.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
' Send the request for the webpage.
request.send
' Get the webpage response data into a variable.
response = StrConv(request.responseBody, vbUnicode)
' Put the webpage into an html object to make data references easier.
html.body.innerHTML = response
' Get the price from the specified element on the page.
price = html.getElementById("kursdaten20").innerText
' Output the price into a message box.
MsgBox price
End Sub
You are searching for element id kursdaten20 that does not exist on the page.
html.getElementById("kursdaten20") returns Nothing and you are accessing the innerText property with Nothing/Null reference.
When searching for element, you could add a check if the element exists:
'query the document
Set element = html.getElementById("kursdaten20")
If Not element Is Nothing Then
' Get the price from the specified element on the page.
price = element.innerText
' Output the price into a message box.
MsgBox price
Else
' no price
MsgBox "no price"
End If
I'm afraid it's more complicated than what you expected it to be.
I will assume that the info you're after is this:
Geldkurs (1 Stuck)4,01 EUR
Briefkurs (1 Stuck)4,11 EUR
These fields are not static. They are dynamically updated (I guess whenever a transaction is made) by scripts. That's why you will not find their ID's in the source code of the HTML page.
There is however a way to get the info you need by replicating the HTTP request that is being sent to the server whenever these fields are updated.
To find this request and its parameters you need to inspect the network traffic, when you load the page, using your browser's developer tools.
This request returns a (quite poorly structured IMHO) JSON response containing another JSON (!!) which contains the info you want, in HTML format(!!). Here's how the second JSON looks like:
To make things even worse, the names that you can see under state, change with each request you send.
So, firstly you need to parse the json response. Then you need to parse the json within the initial json response to get your hands on the HTML code. Then, using an HTML document object, you can easily get access to the HTML table, containing the desired information.
Here's the way to do it:
Option Explicit
Sub hsbc()
Dim req As New WinHttpRequest
Dim doc As New HTMLDocument
Dim table As HTMLTable
Dim cell As HTMLTableCell
Dim parsedJSON As Object
Dim key As Variant
Dim htmlCode As String
Dim url As String, reqBody As String, resp As String
url = "https://www.hsbc-zertifikate.de/web-htde-tip-zertifikate-main/?components=YW1wZWw6UnRQdWxsQ29tcG9uZW50KCdhbmltQ3NzLGMtaGlnaGxpZ2h0LXVwLGMtaGlnaGxpZ2h0LWRvd24sYy1oaWdobGlnaHQtY2hhbmdlZCcpO3NlYXJjaGhpbnRfbW9iaWxlOlNlYXJjaEhpbnRNb2JpbGVDb21wb25lbnQoJ3VsU2VhcmNoU21hbGwvc2VhcmNoSW5wdXRNb2JpbGUnKTtzZWFyY2hoaW50OlNlYXJjaEhpbnRDb21wb25lbnQoJ3VsU2VhcmNoRnVsbC9zZWFyY2gtaGVhZGVyJyk7aXNpbjpSZXNwb25zaXZlU25hcHNob3RDb21wb25lbnQoJ2ZhbHNlJyk%3D&pagepath=https%3A%2F%2Fwww.hsbc-zertifikate.de%2Fhome%2Fdetails%23!%2Fisin%3ADE000TR8S293&magnoliaSessionId=B22F70D76986AB6BACDF110E4E7A724C.public7a&v-1566551332455"
reqBody = "v-browserDetails=1&theme=hsbc&v-appId=myApp&v-sh=1080&v-sw=1920&v-cw=1920&v-ch=550&v-curdate=1566551332455&v-tzo=-180&v-dstd=60&v-rtzo=-120&v-dston=true&v-vw=50&v-vh=50&v-loc=https%3A%2F%2Fwww.hsbc-zertifikate.de%2Fhome%2Fdetails%23!%2Fisin%3ADE000TR8S293&v-wn=myApp-0.5436432044490654"
With req
.Open "POST", url, False
.setRequestHeader "Content-type", "application/x-www-form-urlencoded"
.send reqBody
resp = .responseText
End With
Set parsedJSON = JsonConverter.ParseJson(resp)
Set parsedJSON = JsonConverter.ParseJson(parsedJSON("uidl"))
For Each key In parsedJSON("state").Keys
If parsedJSON("state")(key)("contentMode") = "HTML" Then
htmlCode = htmlCode & parsedJSON("state")(key)("text")
End If
Next key
doc.body.innerHTML = htmlCode
Set table = doc.getElementsByTagName("table")(0)
Debug.Print table.Rows(2).innerText
Debug.Print table.Rows(3).innerText
End Sub
For demonstration purposes the result will be printed in your immediate window.
You will need to add the following references to your project (VBE>Tools>References):
Microsoft WinHTTP Services version 5.1
Microsoft HTML Objects Library
Microsoft Scripting Runtime
You will also need to add this JSON parser to your project. Follow the installation instructions in the link and you should be set to go.

Resources