VBA: Extract page title via HttpRequest

VBA: Extract page title via HttpRequest - excel

I usually use the below code to scrape website.
Dim html As New HTMLDocument
html.body.innerHTML = HttpReq.responseText
With this code, "html.getElementsByTagName("title")(0).innerText" returns empty string. Is there any way to get page title using DOM?
Thanks in advance

Related

Scraping "href" using Chrome

long time listener first time poster,
I am hoping to get some help scraping the href attribute from a website using google chrome. I have searched and tried for hours and for the life of me cannot get the code to work.
This is the website: https://pool.pm/addr1qxlxmpqamdnzs9gpgvjnsxehu4pd95a9ddhhcuxadvzv69jjtu4lhppapqxxgtsxweackk6se5m3zp9qkadsu62de8uqrp3dk4/%409e9e948d
This is a snippet of HTML code that I am trying to retrieve.
One of the things I noticed is that "topics" returns empty values and is not pulling what i need to. So this makes the rest of my code irrelevant. I am sure I am missing something fundamental, but I cannot find it. Any help would be greatly appreciated.
My code is currently as follows:
Option Explicit
Sub openurl()
Dim myurl As String
Dim request As Object
Dim response As String
Dim html As New HTMLDocument
Dim topics As Object
Dim titleElem As Object
myurl = "https://pool.pm/addr1qxlxmpqamdnzs9gpgvjnsxehu4pd95a9ddhhcuxadvzv69jjtu4lhppapqxxgtsxweackk6se5m3zp9qkadsu62de8uqrp3dk4/%409e9e948d"
Set request = CreateObject("MSXML2.XMLHTTP")
request.Open "GET", myurl, False
request.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
request.send
response = StrConv(request.responseBody, vbUnicode)
html.body.innerHTML = response
Set topics = html.getElementsByClassName("hc ah cx s e wc ccx ccy lnk")
Sheets("main").Range("A3").Value = topics.getElementsByTagName("a").href
End Sub

The site is not generating HTML without JavaScript, so you have the wrong URL.
This currently has nothing to do with Chrome. You're making a simple HTTP Request.
This is the async resource that has the item you want:
https://pool.pm/wallet/addr1qxlxmpqamdnzs9gpgvjnsxehu4pd95a9ddhhcuxadvzv69jjtu4lhppapqxxgtsxweackk6se5m3zp9qkadsu62de8uqrp3dk4
It returns a json Response, one of the fields in an array of tokens.
You have to parse the json Response with VBA.
Entry 207 has the item you want.
You could probably just loop the tokens and access .name and .policy to generate all the href links you want.
{
...
tokens[207].policy = "9e9e948d01bc64e29c26fbf85922d8d80dbf987222ffb45a6fe9f480",
tokens[207].name = "DungeonLootersClubWeapon0112"
}

pulling href after getting elements

Hopefully a really easy easy question but this is something I always seem to run into issues when web scraping.
I'm webscraping from a database containing many chemical dossiers, some of which have a separate section for toxicological information and some do not. In this example the url provided is fixed as I know this does contain a link to the toxicological info and so will pull the "Sub" url from.
I wish to check if the website has this info by pulling this ur, and if not running conditional code to give a message saying no tox info etc..
Inspecting the page:
<li id="SubNav7_1" class="active"> Toxicological Summary </li>
I have navigated correctly to SubNav7 but I run into a runtime error 13 when trying to get the url.
Public Sub GetContents()
'Start ECHA Search via XML HTTP Request
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLReq.Open "Get", "https://echa.europa.eu/registration-dossier/-/registered-dossier/15460", False
XMLReq.send
HTMLDoc.body.innerHTML = XMLReq.responseText
'GetLink
Set link = HTMLDoc.getElementById("SubNav7_1").getAttribute("href")
Debug.Print link
End Sub
The expected output is https://echa.europa.eu/registration-dossier/-/registered-dossier/15460/7/1
If anyone could point how I can essentially get to the a tag attribute under SubNav7_1 that'd be great

When you print the whole website, you will notice that the href attribute you're looking for is not in the SubNav7_1 element. It's in a element inside it:
<li id="SubNav7_1">
Toxicological Summary
</li>
Therefore, you're getting an error accessing "href" attribute of the "li" element, because such an attribute does not exist.
If you're wondering, here's how I modified your code to see what's going on in the site you're scraping (and how I got the HTML shown above):
'Start ECHA Search via XML HTTP Request
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLReq.Open "Get", "https://echa.europa.eu/registration-dossier/-/registered-dossier/15460", False
XMLReq.send
HTMLDoc.body.innerHTML = XMLReq.responseText
'GetLink
Dim link As String
Debug.Print Mid(HTMLDoc.body.outerHTML, InStr(1, HTMLDoc.body.outerHTML, "SubNav7_1"), 150)
link = HTMLDoc.getElementById("SubNav7_1").getAttribute("href")
Debug.Print link

I have a list of Twitter urls in Column A, for which I am trying to pull some information off, however I am having a lot of trouble. I want to pull off everything in yellow
I am not sure if it is due to having the wrong classes or due to the Twitter Urls NOT opening in excel. If I double click a url in excel and try to open it I get this error message.
The link works fine when I copy and paste them into the browser. I have read some information on the web that states that a HKEY on the PC may need changing LINK. The problem I have the person I am building this for is not pc literate and will struggle, to do any fix.
I have always used the below code for scraping and it has never failed me. When it does pull data off Twitter, I get an error message, see image below columns D + E. I am assuming this is making some contact to Twitter but can not access the page to extract the data. I am NOT using IE as it no longer works with twitter, I am using a MSXML2.ServerXMLHTTP.
This is what i am using to extract the data, it is the same for all the columns, just the class changes and if it is a Span or a child.
''''Element 3 Column D
If doc.getElementsByClassName("css-1dbjc4n")(0) Is Nothing Then
wsSheet.Cells(StartRow + myCounter, 4).Value = "-"
Else
wsSheet.Cells(StartRow + myCounter, 4).Value = doc.getElementsByClassName("css-1dbjc4n")(0).getElementsByTagName("Span")(0).innerText
End If
Public Function NewHTMLDocument(strURL As String) As Object
Dim objHTTP As Object, objHTML As Object, strTemp As String
Set objHTTP = CreateObject("MSXML2.ServerXMLHTTP")
objHTTP.setOption(2) = 13056
objHTTP.Open "GET", strURL, False
objHTTP.send
If objHTTP.Status = 200 Then
strTemp = objHTTP.responseText
Set objHTML = CreateObject("htmlfile")
objHTML.body.innerHTML = strTemp
Set NewHTMLDocument = objHTML
Else
'There has been an error
End If
End Function
QUESTION
Is the problem due to the urls not opening in excel, or is it because the data is dynamic and it can not be extracted?
Twitter Link 1
Twitter Link 2
As always thanks for having a look and my apologies in advance for NOT adding HTML snippet as it would not let me post, I could not find the error so removed the html, it was stating that a URL had been shortened, but could not find it so removed the whole html snippet in order to post.
UPDATE
I thought this link was in my post, but I must have removed it when I removed the HTML Snippet. I found this on Stackoverflow but could not get it to work form me, nothing would extract Link

Click Java Button in URL with Excel VBA

Trying to achive downloading table from company website. I can download first page. However, cannot jump to second page.
HTML CODE for Page Number
1
HTML CODE
[![HTML CODE FOR TABLE][1]][1]
page numbers are inside table and increasing one by one. at the first time when page one is active link href is not visible and shows as
<span>1</span>
I use below code to click page however I cannot succeded.
Set doc = ie.document
i = 0
For Each link In doc.Links
'doing downloading stuff here
i = i + 1
link.innerText = "javascript:__doPostBack('ctl00$View$gv','Page$" & i
link.Click
Next
When I check the page also there is a javascript function.
Javasript CODE
//<![CDATA[
var theForm = document.forms['aspnetForm'];
if (!theForm) {
theForm = document.aspnetForm;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
//]]>
after first page downloaded, macro click irrelevant page links even never click same page for each time.
Extra Question
also is there any way to get href values instead of innertext on below code
User Name
Thanks

Open any page by parameter of the url:
Look if you can open any page directly by a parameter of the url for the page number like this:
https://yourUrl.com?page=2
Then the walk through all pages is very easy. The only thing you must check at first is the number of the pages or a html code that only is in the page code when you try to open a page that is not available.
How to get href
You can't click innertext. That is only a string. You ask for a way to get the href and that is the right thought. If you want get the href of the first a-tag you can use this:
'Part of your code to open the page
'...
Dim nodeFirstLink as Object
Set nodeFirstLink = doc.getElementsByTagName("a")(0)
Debug.Print nodeFirstLink.href
'More of your code
'...
Here is an example how to change the href
But I don't know if this works also with JS links:
Sub ChangeHref()
Dim htmlDoc As Object
Dim nodeFirstLink As Object
'Set a short HTML Document for this example
Set htmlDoc = CreateObject("HtmlFile")
htmlDoc.body.innerHTML = "<a href='https://amazon.com'>Amazon</a>"
Set nodeFirstLink = htmlDoc.getElementsByTagName("a")(0) 'Get the first Link
Debug.Print nodeFirstLink.outerhtml 'The HTML of the first link in the html document
Debug.Print nodeFirstLink.href 'Only the href of the first link in the html document
nodeFirstLink.href = "https://ebay.com" 'Changing the href in the first link
Debug.Print nodeFirstLink.outerhtml 'The innertext is still Amazon
Debug.Print nodeFirstLink.href 'The href is the new one
End Sub

Error when scraping HTML: Object variable or With block variable not set

I copied code to get stock data from hsbc derivatives. (https://www.youtube.com/watch?v=IOzHacoP-u4)
I changed the URL (to hsbc) and that I want to find the value based on the ID, not the class name.
I changed the ID name.
I get
"Run Time Error-91:
Object variable or With block variable not set".
Sub Get_Web_Data()
Dim request As Object
Dim response As String
Dim html As New HTMLDocument
Dim website As String
Dim price As Variant
' Website to go to.
website = "https://www.hsbc-zertifikate.de/home/details#!/isin:DE000TR8S293"
' Create the object that will make the webpage request.
Set request = CreateObject("MSXML2.XMLHTTP")
' Where to go and how to go there - probably don't need to change this.
request.Open "GET", website, False
' Get fresh data.
'request.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
' Send the request for the webpage.
request.send
' Get the webpage response data into a variable.
response = StrConv(request.responseBody, vbUnicode)
' Put the webpage into an html object to make data references easier.
html.body.innerHTML = response
' Get the price from the specified element on the page.
price = html.getElementById("kursdaten20").innerText
' Output the price into a message box.
MsgBox price
End Sub

You are searching for element id kursdaten20 that does not exist on the page.
html.getElementById("kursdaten20") returns Nothing and you are accessing the innerText property with Nothing/Null reference.
When searching for element, you could add a check if the element exists:
'query the document
Set element = html.getElementById("kursdaten20")
If Not element Is Nothing Then
' Get the price from the specified element on the page.
price = element.innerText
' Output the price into a message box.
MsgBox price
Else
' no price
MsgBox "no price"
End If

I'm afraid it's more complicated than what you expected it to be.
I will assume that the info you're after is this:
Geldkurs (1 Stuck)4,01 EUR
Briefkurs (1 Stuck)4,11 EUR
These fields are not static. They are dynamically updated (I guess whenever a transaction is made) by scripts. That's why you will not find their ID's in the source code of the HTML page.
There is however a way to get the info you need by replicating the HTTP request that is being sent to the server whenever these fields are updated.
To find this request and its parameters you need to inspect the network traffic, when you load the page, using your browser's developer tools.
This request returns a (quite poorly structured IMHO) JSON response containing another JSON (!!) which contains the info you want, in HTML format(!!). Here's how the second JSON looks like:
To make things even worse, the names that you can see under state, change with each request you send.
So, firstly you need to parse the json response. Then you need to parse the json within the initial json response to get your hands on the HTML code. Then, using an HTML document object, you can easily get access to the HTML table, containing the desired information.
Here's the way to do it:
Option Explicit
Sub hsbc()
Dim req As New WinHttpRequest
Dim doc As New HTMLDocument
Dim table As HTMLTable
Dim cell As HTMLTableCell
Dim parsedJSON As Object
Dim key As Variant
Dim htmlCode As String
Dim url As String, reqBody As String, resp As String
url = "https://www.hsbc-zertifikate.de/web-htde-tip-zertifikate-main/?components=YW1wZWw6UnRQdWxsQ29tcG9uZW50KCdhbmltQ3NzLGMtaGlnaGxpZ2h0LXVwLGMtaGlnaGxpZ2h0LWRvd24sYy1oaWdobGlnaHQtY2hhbmdlZCcpO3NlYXJjaGhpbnRfbW9iaWxlOlNlYXJjaEhpbnRNb2JpbGVDb21wb25lbnQoJ3VsU2VhcmNoU21hbGwvc2VhcmNoSW5wdXRNb2JpbGUnKTtzZWFyY2hoaW50OlNlYXJjaEhpbnRDb21wb25lbnQoJ3VsU2VhcmNoRnVsbC9zZWFyY2gtaGVhZGVyJyk7aXNpbjpSZXNwb25zaXZlU25hcHNob3RDb21wb25lbnQoJ2ZhbHNlJyk%3D&pagepath=https%3A%2F%2Fwww.hsbc-zertifikate.de%2Fhome%2Fdetails%23!%2Fisin%3ADE000TR8S293&magnoliaSessionId=B22F70D76986AB6BACDF110E4E7A724C.public7a&v-1566551332455"
reqBody = "v-browserDetails=1&theme=hsbc&v-appId=myApp&v-sh=1080&v-sw=1920&v-cw=1920&v-ch=550&v-curdate=1566551332455&v-tzo=-180&v-dstd=60&v-rtzo=-120&v-dston=true&v-vw=50&v-vh=50&v-loc=https%3A%2F%2Fwww.hsbc-zertifikate.de%2Fhome%2Fdetails%23!%2Fisin%3ADE000TR8S293&v-wn=myApp-0.5436432044490654"
With req
.Open "POST", url, False
.setRequestHeader "Content-type", "application/x-www-form-urlencoded"
.send reqBody
resp = .responseText
End With
Set parsedJSON = JsonConverter.ParseJson(resp)
Set parsedJSON = JsonConverter.ParseJson(parsedJSON("uidl"))
For Each key In parsedJSON("state").Keys
If parsedJSON("state")(key)("contentMode") = "HTML" Then
htmlCode = htmlCode & parsedJSON("state")(key)("text")
End If
Next key
doc.body.innerHTML = htmlCode
Set table = doc.getElementsByTagName("table")(0)
Debug.Print table.Rows(2).innerText
Debug.Print table.Rows(3).innerText
End Sub
For demonstration purposes the result will be printed in your immediate window.
You will need to add the following references to your project (VBE>Tools>References):
Microsoft WinHTTP Services version 5.1
Microsoft HTML Objects Library
Microsoft Scripting Runtime
You will also need to add this JSON parser to your project. Follow the installation instructions in the link and you should be set to go.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string