Hopefully a really easy easy question but this is something I always seem to run into issues when web scraping.
I'm webscraping from a database containing many chemical dossiers, some of which have a separate section for toxicological information and some do not. In this example the url provided is fixed as I know this does contain a link to the toxicological info and so will pull the "Sub" url from.
I wish to check if the website has this info by pulling this ur, and if not running conditional code to give a message saying no tox info etc..
Inspecting the page:
<li id="SubNav7_1" class="active"> Toxicological Summary </li>
I have navigated correctly to SubNav7 but I run into a runtime error 13 when trying to get the url.
Public Sub GetContents()
'Start ECHA Search via XML HTTP Request
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLReq.Open "Get", "https://echa.europa.eu/registration-dossier/-/registered-dossier/15460", False
XMLReq.send
HTMLDoc.body.innerHTML = XMLReq.responseText
'GetLink
Set link = HTMLDoc.getElementById("SubNav7_1").getAttribute("href")
Debug.Print link
End Sub
The expected output is https://echa.europa.eu/registration-dossier/-/registered-dossier/15460/7/1
If anyone could point how I can essentially get to the a tag attribute under SubNav7_1 that'd be great
When you print the whole website, you will notice that the href attribute you're looking for is not in the SubNav7_1 element. It's in a element inside it:
<li id="SubNav7_1">
Toxicological Summary
</li>
Therefore, you're getting an error accessing "href" attribute of the "li" element, because such an attribute does not exist.
If you're wondering, here's how I modified your code to see what's going on in the site you're scraping (and how I got the HTML shown above):
'Start ECHA Search via XML HTTP Request
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLReq.Open "Get", "https://echa.europa.eu/registration-dossier/-/registered-dossier/15460", False
XMLReq.send
HTMLDoc.body.innerHTML = XMLReq.responseText
'GetLink
Dim link As String
Debug.Print Mid(HTMLDoc.body.outerHTML, InStr(1, HTMLDoc.body.outerHTML, "SubNav7_1"), 150)
link = HTMLDoc.getElementById("SubNav7_1").getAttribute("href")
Debug.Print link
Related
long time listener first time poster,
I am hoping to get some help scraping the href attribute from a website using google chrome. I have searched and tried for hours and for the life of me cannot get the code to work.
This is the website: https://pool.pm/addr1qxlxmpqamdnzs9gpgvjnsxehu4pd95a9ddhhcuxadvzv69jjtu4lhppapqxxgtsxweackk6se5m3zp9qkadsu62de8uqrp3dk4/%409e9e948d
This is a snippet of HTML code that I am trying to retrieve.
One of the things I noticed is that "topics" returns empty values and is not pulling what i need to. So this makes the rest of my code irrelevant. I am sure I am missing something fundamental, but I cannot find it. Any help would be greatly appreciated.
My code is currently as follows:
Option Explicit
Sub openurl()
Dim myurl As String
Dim request As Object
Dim response As String
Dim html As New HTMLDocument
Dim topics As Object
Dim titleElem As Object
myurl = "https://pool.pm/addr1qxlxmpqamdnzs9gpgvjnsxehu4pd95a9ddhhcuxadvzv69jjtu4lhppapqxxgtsxweackk6se5m3zp9qkadsu62de8uqrp3dk4/%409e9e948d"
Set request = CreateObject("MSXML2.XMLHTTP")
request.Open "GET", myurl, False
request.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
request.send
response = StrConv(request.responseBody, vbUnicode)
html.body.innerHTML = response
Set topics = html.getElementsByClassName("hc ah cx s e wc ccx ccy lnk")
Sheets("main").Range("A3").Value = topics.getElementsByTagName("a").href
End Sub
The site is not generating HTML without JavaScript, so you have the wrong URL.
This currently has nothing to do with Chrome. You're making a simple HTTP Request.
This is the async resource that has the item you want:
https://pool.pm/wallet/addr1qxlxmpqamdnzs9gpgvjnsxehu4pd95a9ddhhcuxadvzv69jjtu4lhppapqxxgtsxweackk6se5m3zp9qkadsu62de8uqrp3dk4
It returns a json Response, one of the fields in an array of tokens.
You have to parse the json Response with VBA.
Entry 207 has the item you want.
You could probably just loop the tokens and access .name and .policy to generate all the href links you want.
{
...
tokens[207].policy = "9e9e948d01bc64e29c26fbf85922d8d80dbf987222ffb45a6fe9f480",
tokens[207].name = "DungeonLootersClubWeapon0112"
}
I'm having trouble trying to retrieve the IUPAC name of a chemical on the following page:
https://echa.europa.eu/brief-profile/-/briefprofile/100.000.685
I'd simply like the printed result to return as Benzene in this example.
The code below pulls all elements with className `
Public Sub GetContents()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.000.685", False
XMLReq.send
HTMLDoc.body.innerHTML = XMLReq.responseText
Set IUPACName = HTMLDoc.getElementsByClassName("col-sm-8")(0)
Debug.Print IUPACName.innerText
End Sub
This returns:
EC / List name: IUPAC name: benzene Substance names and other identifiers
Inspecting the page there doesn't seem to be any obvious identifier to just return Benzene. Wondering how people would go about this.
Here is an image of the Text I wish to pull.
I can't test on other Office versions but 2019, at least, you can use an attribute selector as follows:
Set IUPACName = HTMLDoc.querySelector("[title*=IUPAC]")
Debug.Print IUPACName.innerText
I was expecting to use:
Debug.Print IUPACName.NextSibling.NodeValue
So, that latter one maybe what you need on your Office version.
The world of mshtml.dll is quite topsy-turvy as moment.
I have a list of Twitter urls in Column A, for which I am trying to pull some information off, however I am having a lot of trouble. I want to pull off everything in yellow
I am not sure if it is due to having the wrong classes or due to the Twitter Urls NOT opening in excel. If I double click a url in excel and try to open it I get this error message.
The link works fine when I copy and paste them into the browser. I have read some information on the web that states that a HKEY on the PC may need changing LINK. The problem I have the person I am building this for is not pc literate and will struggle, to do any fix.
I have always used the below code for scraping and it has never failed me. When it does pull data off Twitter, I get an error message, see image below columns D + E. I am assuming this is making some contact to Twitter but can not access the page to extract the data. I am NOT using IE as it no longer works with twitter, I am using a MSXML2.ServerXMLHTTP.
This is what i am using to extract the data, it is the same for all the columns, just the class changes and if it is a Span or a child.
''''Element 3 Column D
If doc.getElementsByClassName("css-1dbjc4n")(0) Is Nothing Then
wsSheet.Cells(StartRow + myCounter, 4).Value = "-"
Else
wsSheet.Cells(StartRow + myCounter, 4).Value = doc.getElementsByClassName("css-1dbjc4n")(0).getElementsByTagName("Span")(0).innerText
End If
Public Function NewHTMLDocument(strURL As String) As Object
Dim objHTTP As Object, objHTML As Object, strTemp As String
Set objHTTP = CreateObject("MSXML2.ServerXMLHTTP")
objHTTP.setOption(2) = 13056
objHTTP.Open "GET", strURL, False
objHTTP.send
If objHTTP.Status = 200 Then
strTemp = objHTTP.responseText
Set objHTML = CreateObject("htmlfile")
objHTML.body.innerHTML = strTemp
Set NewHTMLDocument = objHTML
Else
'There has been an error
End If
End Function
QUESTION
Is the problem due to the urls not opening in excel, or is it because the data is dynamic and it can not be extracted?
Twitter Link 1
Twitter Link 2
As always thanks for having a look and my apologies in advance for NOT adding HTML snippet as it would not let me post, I could not find the error so removed the html, it was stating that a URL had been shortened, but could not find it so removed the whole html snippet in order to post.
UPDATE
I thought this link was in my post, but I must have removed it when I removed the HTML Snippet. I found this on Stackoverflow but could not get it to work form me, nothing would extract Link
Trying to achive downloading table from company website. I can download first page. However, cannot jump to second page.
HTML CODE for Page Number
1
HTML CODE
[![HTML CODE FOR TABLE][1]][1]
page numbers are inside table and increasing one by one. at the first time when page one is active link href is not visible and shows as
<span>1</span>
I use below code to click page however I cannot succeded.
Set doc = ie.document
i = 0
For Each link In doc.Links
'doing downloading stuff here
i = i + 1
link.innerText = "javascript:__doPostBack('ctl00$View$gv','Page$" & i
link.Click
Next
When I check the page also there is a javascript function.
Javasript CODE
//<![CDATA[
var theForm = document.forms['aspnetForm'];
if (!theForm) {
theForm = document.aspnetForm;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
//]]>
after first page downloaded, macro click irrelevant page links even never click same page for each time.
Extra Question
also is there any way to get href values instead of innertext on below code
User Name
Thanks
Open any page by parameter of the url:
Look if you can open any page directly by a parameter of the url for the page number like this:
https://yourUrl.com?page=2
Then the walk through all pages is very easy. The only thing you must check at first is the number of the pages or a html code that only is in the page code when you try to open a page that is not available.
How to get href
You can't click innertext. That is only a string. You ask for a way to get the href and that is the right thought. If you want get the href of the first a-tag you can use this:
'Part of your code to open the page
'...
Dim nodeFirstLink as Object
Set nodeFirstLink = doc.getElementsByTagName("a")(0)
Debug.Print nodeFirstLink.href
'More of your code
'...
Here is an example how to change the href
But I don't know if this works also with JS links:
Sub ChangeHref()
Dim htmlDoc As Object
Dim nodeFirstLink As Object
'Set a short HTML Document for this example
Set htmlDoc = CreateObject("HtmlFile")
htmlDoc.body.innerHTML = "<a href='https://amazon.com'>Amazon</a>"
Set nodeFirstLink = htmlDoc.getElementsByTagName("a")(0) 'Get the first Link
Debug.Print nodeFirstLink.outerhtml 'The HTML of the first link in the html document
Debug.Print nodeFirstLink.href 'Only the href of the first link in the html document
nodeFirstLink.href = "https://ebay.com" 'Changing the href in the first link
Debug.Print nodeFirstLink.outerhtml 'The innertext is still Amazon
Debug.Print nodeFirstLink.href 'The href is the new one
End Sub
I usually use the below code to scrape website.
Dim html As New HTMLDocument
html.body.innerHTML = HttpReq.responseText
With this code, "html.getElementsByTagName("title")(0).innerText" returns empty string. Is there any way to get page title using DOM?
Thanks in advance