Scraping "href" using Chrome

Scraping "href" using Chrome - excel

long time listener first time poster,
I am hoping to get some help scraping the href attribute from a website using google chrome. I have searched and tried for hours and for the life of me cannot get the code to work.
This is the website: https://pool.pm/addr1qxlxmpqamdnzs9gpgvjnsxehu4pd95a9ddhhcuxadvzv69jjtu4lhppapqxxgtsxweackk6se5m3zp9qkadsu62de8uqrp3dk4/%409e9e948d
This is a snippet of HTML code that I am trying to retrieve.
One of the things I noticed is that "topics" returns empty values and is not pulling what i need to. So this makes the rest of my code irrelevant. I am sure I am missing something fundamental, but I cannot find it. Any help would be greatly appreciated.
My code is currently as follows:
Option Explicit
Sub openurl()
Dim myurl As String
Dim request As Object
Dim response As String
Dim html As New HTMLDocument
Dim topics As Object
Dim titleElem As Object
myurl = "https://pool.pm/addr1qxlxmpqamdnzs9gpgvjnsxehu4pd95a9ddhhcuxadvzv69jjtu4lhppapqxxgtsxweackk6se5m3zp9qkadsu62de8uqrp3dk4/%409e9e948d"
Set request = CreateObject("MSXML2.XMLHTTP")
request.Open "GET", myurl, False
request.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
request.send
response = StrConv(request.responseBody, vbUnicode)
html.body.innerHTML = response
Set topics = html.getElementsByClassName("hc ah cx s e wc ccx ccy lnk")
Sheets("main").Range("A3").Value = topics.getElementsByTagName("a").href
End Sub

The site is not generating HTML without JavaScript, so you have the wrong URL.
This currently has nothing to do with Chrome. You're making a simple HTTP Request.
This is the async resource that has the item you want:
https://pool.pm/wallet/addr1qxlxmpqamdnzs9gpgvjnsxehu4pd95a9ddhhcuxadvzv69jjtu4lhppapqxxgtsxweackk6se5m3zp9qkadsu62de8uqrp3dk4
It returns a json Response, one of the fields in an array of tokens.
You have to parse the json Response with VBA.
Entry 207 has the item you want.
You could probably just loop the tokens and access .name and .policy to generate all the href links you want.
{
...
tokens[207].policy = "9e9e948d01bc64e29c26fbf85922d8d80dbf987222ffb45a6fe9f480",
tokens[207].name = "DungeonLootersClubWeapon0112"
}

Related

pulling href after getting elements

Hopefully a really easy easy question but this is something I always seem to run into issues when web scraping.
I'm webscraping from a database containing many chemical dossiers, some of which have a separate section for toxicological information and some do not. In this example the url provided is fixed as I know this does contain a link to the toxicological info and so will pull the "Sub" url from.
I wish to check if the website has this info by pulling this ur, and if not running conditional code to give a message saying no tox info etc..
Inspecting the page:
<li id="SubNav7_1" class="active"> Toxicological Summary </li>
I have navigated correctly to SubNav7 but I run into a runtime error 13 when trying to get the url.
Public Sub GetContents()
'Start ECHA Search via XML HTTP Request
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLReq.Open "Get", "https://echa.europa.eu/registration-dossier/-/registered-dossier/15460", False
XMLReq.send
HTMLDoc.body.innerHTML = XMLReq.responseText
'GetLink
Set link = HTMLDoc.getElementById("SubNav7_1").getAttribute("href")
Debug.Print link
End Sub
The expected output is https://echa.europa.eu/registration-dossier/-/registered-dossier/15460/7/1
If anyone could point how I can essentially get to the a tag attribute under SubNav7_1 that'd be great

When you print the whole website, you will notice that the href attribute you're looking for is not in the SubNav7_1 element. It's in a element inside it:
<li id="SubNav7_1">
Toxicological Summary
</li>
Therefore, you're getting an error accessing "href" attribute of the "li" element, because such an attribute does not exist.
If you're wondering, here's how I modified your code to see what's going on in the site you're scraping (and how I got the HTML shown above):
'Start ECHA Search via XML HTTP Request
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLReq.Open "Get", "https://echa.europa.eu/registration-dossier/-/registered-dossier/15460", False
XMLReq.send
HTMLDoc.body.innerHTML = XMLReq.responseText
'GetLink
Dim link As String
Debug.Print Mid(HTMLDoc.body.outerHTML, InStr(1, HTMLDoc.body.outerHTML, "SubNav7_1"), 150)
link = HTMLDoc.getElementById("SubNav7_1").getAttribute("href")
Debug.Print link

Vba Scraping Twitter details

I have a list of Twitter urls in Column A, for which I am trying to pull some information off, however I am having a lot of trouble. I want to pull off everything in yellow
I am not sure if it is due to having the wrong classes or due to the Twitter Urls NOT opening in excel. If I double click a url in excel and try to open it I get this error message.
The link works fine when I copy and paste them into the browser. I have read some information on the web that states that a HKEY on the PC may need changing LINK. The problem I have the person I am building this for is not pc literate and will struggle, to do any fix.
I have always used the below code for scraping and it has never failed me. When it does pull data off Twitter, I get an error message, see image below columns D + E. I am assuming this is making some contact to Twitter but can not access the page to extract the data. I am NOT using IE as it no longer works with twitter, I am using a MSXML2.ServerXMLHTTP.
This is what i am using to extract the data, it is the same for all the columns, just the class changes and if it is a Span or a child.
''''Element 3 Column D
If doc.getElementsByClassName("css-1dbjc4n")(0) Is Nothing Then
wsSheet.Cells(StartRow + myCounter, 4).Value = "-"
Else
wsSheet.Cells(StartRow + myCounter, 4).Value = doc.getElementsByClassName("css-1dbjc4n")(0).getElementsByTagName("Span")(0).innerText
End If
Public Function NewHTMLDocument(strURL As String) As Object
Dim objHTTP As Object, objHTML As Object, strTemp As String
Set objHTTP = CreateObject("MSXML2.ServerXMLHTTP")
objHTTP.setOption(2) = 13056
objHTTP.Open "GET", strURL, False
objHTTP.send
If objHTTP.Status = 200 Then
strTemp = objHTTP.responseText
Set objHTML = CreateObject("htmlfile")
objHTML.body.innerHTML = strTemp
Set NewHTMLDocument = objHTML
Else
'There has been an error
End If
End Function
QUESTION
Is the problem due to the urls not opening in excel, or is it because the data is dynamic and it can not be extracted?
Twitter Link 1
Twitter Link 2
As always thanks for having a look and my apologies in advance for NOT adding HTML snippet as it would not let me post, I could not find the error so removed the html, it was stating that a URL had been shortened, but could not find it so removed the whole html snippet in order to post.
UPDATE
I thought this link was in my post, but I must have removed it when I removed the HTML Snippet. I found this on Stackoverflow but could not get it to work form me, nothing would extract Link

Trying to get ResponseText from a GET request in Twitter

I'm trying to improve my knowledge of VBA, learning about GET, POST and stuff, because I've seen many examples, and can't get what I'm doing wrong. Probably is the Oauth part.
The main problem is that I'm just an Excel guy. I'm not web developer, so my knowledge is almost null, and probably I'm missing a lot of basic stuff.
I hope this question is not too broad.
BACKGROUND: I'm trying to get the ResponseText of a JSON object, from a tweet. The information is public and you don't need to be logged in to see the info I want to get, and you don't need a Twitter account.
For testing, I'm using this tweet: https://twitter.com/StackOverflow/status/1273391252357201922
WHAT I WANT: Checking the code with Developer Tools (I'm using Firefox), I've seen this:
This GET request returns this ResponseText:
So I would like to get that ResponseText into VBA.
MY CODE: Checking different codes here in SO, I've build up this:
Sub test()
Dim MiHttp As Object
Dim MiUrl As String
Set MiHttp = CreateObject("MSXML2.XMLHTTP")
MiUrl = "https://api.twitter.com/2/timeline/conversation/1273391252357201922.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_reply_count=1&tweet_mode=extended&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&send_error_codes=true&simple_quoted_tweet=true&count=20&ext=mediaStats%2ChighlightedLabel&include_quote_count=true"
With MiHttp
.Open "GET", MiUrl
.Send
DoEvents
Debug.Print .responseText
End With
MiHttp.abort
Set MiHttp = Nothing
End Sub
And it runs, no coding errors, but I get this:
{"errors":[{"code":200,"message":"Forbidden."}]}
So I tried adding RequestHeaders with Authoritation:
adding this line of code before .Send:
.setRequestHeader "authorization", "Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA"
And then I get this in the debugger:
{"errors":[{"message":"Rate limit exceeded","code":88}]}
So checked the Twitter library for developers looking info about Bearer stuff and tokens and I must admit I got overwhelmed.
About
Bearer
About
Tokens
And now I'm lost. I thought this would be kind of easy, because it's public info that everyone can get manually, from any tweet, without using any app or logging in Twitter, but it's looks like I'm wrong, and I'm kind of lost.
FINAL QUESTION: I would like to know if I can get that Bearer token in any way, then apply it into my code, to get that JSON responseText (dealing with the JSON and learning about them would be a totally different question, out of scope here).
And I would like to achieve this with VBA, no other apps or languages, because I've have no idea.
Actually I'm not even interested in the full text, just the part surrounded with red line.
Looking for some help, guide, light.
Thanks in advance and I hope this question is not too broad.
Thanks!
UPDATES: Tested #ChristosLytras's answer. I get this error:
UPDATE JULY 2020: now the working url is:
https://api.twitter.com/2/timeline/conversation/1273391252357201922.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_reply_count=1&tweet_mode=extended&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&send_error_codes=true&simple_quoted_tweet=true&count=20&ext=mediaStats%2ChighlightedLabel&include_quote_count=true

You have to pass a valid fetched Guest Token in the request header along with authorization Bearer and you'll have the response. The twitter public API bearer never changes.
In order to get a new and valid Guest Token for each request, you can make a HEAD request using WinHttp.WinHttpRequest.5.1 instead of MSXML2.XMLHTTP and read the gt cookie using a regular expression like gt=(\d+);. That will fetch the cookie headers each time it's being called. You cannot use MSXML2.XMLHTTP because it uses cache and you won't get a new Guest Token each time you request the HEAD.
Working code tested using Excel 2013 with VBA 7.1:
Dim MiHttp As Object
Dim GuestTokenRE As Object
Dim MiUrl As String
Set MiHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
Set GuestTokenRE = CreateObject("VBScript.RegExp")
MiUrl = "https://api.twitter.com/2/timeline/conversation/1273391252357201922.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_reply_count=1&tweet_mode=extended&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&send_error_codes=true&simple_quoted_tweet=true&count=20&ext=mediaStats%2ChighlightedLabel&include_quote_count=true"
With MiHttp
' Make a HEAD request with no cache to get the Guest Token cookie
.Open "HEAD", "https://twitter.com", False
.setRequestHeader "User-Agent", "Firefox"
.setRequestHeader "Pragma", "no-cache"
.setRequestHeader "Cache-Control", "no-cache"
.Send
DoEvents
' Use a regular expression to extract guest token from response headers
GuestTokenRE.Pattern = "Set-Cookie: gt=(\d+);"
GuestTokenRE.IgnoreCase = True
Dim matches as Object
Set matches = GuestTokenRE.Execute(.getAllResponseHeaders())
If matches.Count = 1 Then
Dim guestToken As String
guestToken = matches.Item(0).Submatches.Item(0)
' Print the Guest Token for validation
Debug.Print "Got Guest Token", guestToken
' Now we have a valid guest token, make the request
.Open "GET", MiUrl, False
' Authorization Bearer is always the same
.setRequestHeader "authorization", "Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA"
.setRequestHeader "x-guest-token", guestToken
.Send
DoEvents
Debug.Print "Got response", .responseText
Else
Debug.Print "Could not fetch Guest Token"
End If
End With
MiHttp.abort
Set MiHttp = Nothing
Set GuestTokenRE = Nothing
Regarding 80072efe error
You'll have to get WinHttp.WinHttpRequest.5.1 to work. The 80072efe error indicates the connection terminates abnormally and you can read more about it here. I didn't have such an issue so these errors do not originate from the endpoint.
Screen capture of the code in action

Error when scraping HTML: Object variable or With block variable not set

I copied code to get stock data from hsbc derivatives. (https://www.youtube.com/watch?v=IOzHacoP-u4)
I changed the URL (to hsbc) and that I want to find the value based on the ID, not the class name.
I changed the ID name.
I get
"Run Time Error-91:
Object variable or With block variable not set".
Sub Get_Web_Data()
Dim request As Object
Dim response As String
Dim html As New HTMLDocument
Dim website As String
Dim price As Variant
' Website to go to.
website = "https://www.hsbc-zertifikate.de/home/details#!/isin:DE000TR8S293"
' Create the object that will make the webpage request.
Set request = CreateObject("MSXML2.XMLHTTP")
' Where to go and how to go there - probably don't need to change this.
request.Open "GET", website, False
' Get fresh data.
'request.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
' Send the request for the webpage.
request.send
' Get the webpage response data into a variable.
response = StrConv(request.responseBody, vbUnicode)
' Put the webpage into an html object to make data references easier.
html.body.innerHTML = response
' Get the price from the specified element on the page.
price = html.getElementById("kursdaten20").innerText
' Output the price into a message box.
MsgBox price
End Sub

You are searching for element id kursdaten20 that does not exist on the page.
html.getElementById("kursdaten20") returns Nothing and you are accessing the innerText property with Nothing/Null reference.
When searching for element, you could add a check if the element exists:
'query the document
Set element = html.getElementById("kursdaten20")
If Not element Is Nothing Then
' Get the price from the specified element on the page.
price = element.innerText
' Output the price into a message box.
MsgBox price
Else
' no price
MsgBox "no price"
End If

I'm afraid it's more complicated than what you expected it to be.
I will assume that the info you're after is this:
Geldkurs (1 Stuck)4,01 EUR
Briefkurs (1 Stuck)4,11 EUR
These fields are not static. They are dynamically updated (I guess whenever a transaction is made) by scripts. That's why you will not find their ID's in the source code of the HTML page.
There is however a way to get the info you need by replicating the HTTP request that is being sent to the server whenever these fields are updated.
To find this request and its parameters you need to inspect the network traffic, when you load the page, using your browser's developer tools.
This request returns a (quite poorly structured IMHO) JSON response containing another JSON (!!) which contains the info you want, in HTML format(!!). Here's how the second JSON looks like:
To make things even worse, the names that you can see under state, change with each request you send.
So, firstly you need to parse the json response. Then you need to parse the json within the initial json response to get your hands on the HTML code. Then, using an HTML document object, you can easily get access to the HTML table, containing the desired information.
Here's the way to do it:
Option Explicit
Sub hsbc()
Dim req As New WinHttpRequest
Dim doc As New HTMLDocument
Dim table As HTMLTable
Dim cell As HTMLTableCell
Dim parsedJSON As Object
Dim key As Variant
Dim htmlCode As String
Dim url As String, reqBody As String, resp As String
url = "https://www.hsbc-zertifikate.de/web-htde-tip-zertifikate-main/?components=YW1wZWw6UnRQdWxsQ29tcG9uZW50KCdhbmltQ3NzLGMtaGlnaGxpZ2h0LXVwLGMtaGlnaGxpZ2h0LWRvd24sYy1oaWdobGlnaHQtY2hhbmdlZCcpO3NlYXJjaGhpbnRfbW9iaWxlOlNlYXJjaEhpbnRNb2JpbGVDb21wb25lbnQoJ3VsU2VhcmNoU21hbGwvc2VhcmNoSW5wdXRNb2JpbGUnKTtzZWFyY2hoaW50OlNlYXJjaEhpbnRDb21wb25lbnQoJ3VsU2VhcmNoRnVsbC9zZWFyY2gtaGVhZGVyJyk7aXNpbjpSZXNwb25zaXZlU25hcHNob3RDb21wb25lbnQoJ2ZhbHNlJyk%3D&pagepath=https%3A%2F%2Fwww.hsbc-zertifikate.de%2Fhome%2Fdetails%23!%2Fisin%3ADE000TR8S293&magnoliaSessionId=B22F70D76986AB6BACDF110E4E7A724C.public7a&v-1566551332455"
reqBody = "v-browserDetails=1&theme=hsbc&v-appId=myApp&v-sh=1080&v-sw=1920&v-cw=1920&v-ch=550&v-curdate=1566551332455&v-tzo=-180&v-dstd=60&v-rtzo=-120&v-dston=true&v-vw=50&v-vh=50&v-loc=https%3A%2F%2Fwww.hsbc-zertifikate.de%2Fhome%2Fdetails%23!%2Fisin%3ADE000TR8S293&v-wn=myApp-0.5436432044490654"
With req
.Open "POST", url, False
.setRequestHeader "Content-type", "application/x-www-form-urlencoded"
.send reqBody
resp = .responseText
End With
Set parsedJSON = JsonConverter.ParseJson(resp)
Set parsedJSON = JsonConverter.ParseJson(parsedJSON("uidl"))
For Each key In parsedJSON("state").Keys
If parsedJSON("state")(key)("contentMode") = "HTML" Then
htmlCode = htmlCode & parsedJSON("state")(key)("text")
End If
Next key
doc.body.innerHTML = htmlCode
Set table = doc.getElementsByTagName("table")(0)
Debug.Print table.Rows(2).innerText
Debug.Print table.Rows(3).innerText
End Sub
For demonstration purposes the result will be printed in your immediate window.
You will need to add the following references to your project (VBE>Tools>References):
Microsoft WinHTTP Services version 5.1
Microsoft HTML Objects Library
Microsoft Scripting Runtime
You will also need to add this JSON parser to your project. Follow the installation instructions in the link and you should be set to go.

VBA: Extract page title via HttpRequest

I usually use the below code to scrape website.
Dim html As New HTMLDocument
html.body.innerHTML = HttpReq.responseText
With this code, "html.getElementsByTagName("title")(0).innerText" returns empty string. Is there any way to get page title using DOM?
Thanks in advance

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string