How to get the content of the entire webpage - excel

I have been using the attached code in my VBA macro to get a webpage content for a couple of years.
Recently a new functionality was needed based on the information on the webpage.
I found out that I can see the information I need when I Shift-Ctrl-I in Chrome and Copy the top Element but it is not present in what I’m getting through the code.
What do I need to change in the code to get the whole page? The equivalent of Shift-Ctrl-I Copy Element.
Set Request = CreateObject("MSXML2.XMLHTTP")
Request.Open "GET", ZadanieRef, False
Request.setRequestHeader "If-Modified-Since", "Tue, 1 Jan 2019 00:00:00 GMT"
Request.send
response = StrConv(Request.responseBody, vbUnicode)

I I have looked, this is what I found vba Open URL in Chrome including login and password
The answer was:
Chrome does not have a library in VBA, so there is no way to manipulate chrome like IE. You can only open a webpage direct via Shell, but not manipulate further.
If you know otherwise please provide a link.

Related

VBA Webscrape URL from HTML (src="")

I tried to combine code parts i could make work, but it was working with <span>, <meta> but it is not working with <img>
Can anyone help to make it work?
I try to get:
https://www.lego.com/cdn/cs/set/assets/blt34360a0ffaff7811/11015_alt.png?fit=bounds&format=png&width=800&height=800&dpr=1
From this code:
<img src="https://www.lego.com/cdn/cs/set/assets/blt34360a0ffaff7811/11015_alt.png?fit=bounds&format=png&width=800&height=800&dpr=1" alt="" class="Imagestyles__Img-sc-1qqdbhr-0 cajeby">
Code part where i want to get the Src url
Sub picgrab()
Dim Doc As Object
Dim nodeAllPic As Object
Dim nodeOnePic As Object
Dim pic As Object
Set Doc = CreateObject("htmlFile")
With CreateObject("MSXML2.XMLHTTP.6.0")
url = "https://www.lego.com/hu-hu/product/around-the-world-11015"
.Open "GET", url, False
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0"
.send
' It is important that i can't use InternetExplorer.
'This should work i guess, but it skips after 'For Each' line.
Set nodeAllPic = Doc.getElementsByClassName("Imagestyles__Img-sc-1qqdbhr-0 cajeby")
For Each nodeOnePic In nodeAllPic
If nodeOnePic.getAttribute("class") = "Imagestyles__Img-sc-1qqdbhr-0 cajeby" Then
Set pic = nodeOneVip.getElementsByClassName("Imagestyles__Img-sc-1qqdbhr-0 cajeby")(0)
ActiveCell.Value = pic.getAttribute("src")
End If
Next nodeOnePic
End With
End Sub
I tired the code above and modified it many way, but couldn't get the content of Src="" .
Need to write the response
First of all, you never write the HTML response to your htmlfile object. So you won't be able to find anything when you call the method getElementsByClassName on it.
Make sure that you include the following line before trying to use the Doc object:
Doc.Write .responseText
Dynamic Content
Secondly, some of the content on that page is not in the original HTTP request that XMLHTTP receives. The page contains JavaScript code that loads content dynamically.
To test this in Chrome, you can open the Chrome DevTools window on that page, then disable JavaScript and refresh the page.
You'll then see the original HTML and a notification that says that JavaScript is disabled.
And now, if you search inside the Elements tab, you won't find the element you were looking for (at least I couldn't find anything with a class "cajeby").
Browser emulation
So, now what? Well, you'll need to use an object to manipulate the original response to execute the JavaScript code. For that you could use Selenium. It's the modern way of doing web scraping or any browser automation with VBA.
You can easily find tutorials on how to get started with Selenium for VBA, but I would recommend this video by WiseOwlTutorials.
Then your code could look like this:
Dim Browser As New Selenium.WebDriver
Browser.Start "chrome", "https://www.lego.com/hu-hu/product/around-the-world-11015"
Browser.Get "/"
Dim img As WebElement
Set img = Browser.FindElementByCss(".Imagestyles__Img-sc-1qqdbhr-0.cajeby", timeout:=5000)
Debug.Print img.Attribute("src")
Set Browser = Nothing
Some notes on the code
Make sure that you have included a reference to the Selenium Library
Notice the use of FindElementByCss. This is necessary because you are using 2 class names and no other method currently support that, but you'll need to use the CSS selector syntax. (More about this here).
Notice the use of timeout:=5000 that lets Selenium know that you are willing to wait up to 5000 milliseconds for the JavaScript code to load the content you are looking for (More details here).

Replacing IE Bits with Edge in VBA

To prepare for the eventual 'going away' of IE11, I've been trying to figure out how to replace a couple parts of my code. One involves launching IE and using that browser to scrape some pages. Is there an equivalent way to do the below in Edge? I don't see a way to add a reference to the Edge libraries like I did with 'Microsoft Internet Objects' and IE11.
Dim ie As InternetExplorerMedium: Set ie = New InternetExplorerMedium
Dim html As HTMLDocument
With ie
.Visible = False
.Navigate website 'string that's created above this code
End With
Do While ie.ReadyState <> READYSTATE_COMPLETE
DoEvents
Loop
Application.Wait Now + #12:00:10 AM#
Set html = ie.Document
Thanks everyone for your help.
Ok, a few explanations. I am writing these as a reply so as not to have to split them into several comments.
Does Edge work instead of IE to do web scraping with VBA?
It does not work directly. The reason is that IE has a COM interface (Wikipedia: Component Object Model). No other browser has this interface. Not even Edge.
But for Edge there is also a web driver for Selenium. Even provided directly by MS.
Another alternative - xhr
Since you can't use Selenium because you don't have admin rights, there might be the possibility to use xhr (XML HTTP Request). However, in order to make a statement on this, we would have to know the page that you want to scrape.
Xhr can be used directly from VBA because it does not use a browser. The big limitation is that only static content can be processed. No JavaScript is executed, so nothing is reloaded or generated dynamically in any other way. On the other hand, this option is much faster than browser solutions. Often, a static file provided by the web server is sufficient. This can be an HTML file, a JSON or another data exchange format.
There are many examples of using xhr with VBA here on SO. Take note of the possibility first as another approach. I can't explain the method exhaustively here, also because I don't know everything about it myself. But there are many ways to use it.
By the way
IE will finally be discontinued in June 2022 and will then also no longer be delivered with Windows. That's what I read on the German IT pages a few days ago. But there are already massive restrictions on the use of IE.

Use VBA to open URL in Default-Browser an catch existing Session

I try to open a specific URL of a web-application I'm already logged in (or tells me to login if I'm not) in the default browser (Chrome). When I copy/paste this URL into the browser address bar, it perfectly works. It doesn't when I open this URL by VBA with ThisWorkbook.FollowHyperlink - then it redirects - as a kind of fallback - to the homepage instead the specific URL.
I found out that this is a session problem and VBA somehow doesn't recognize/catch the existing session.
As "ugly workaround" I'm currently redirecting over http://www.dereferer.org/ to the specific URL, what perfectly works, but is needing additional time.
This doesn't work:
ThisWorkbook.FollowHyperlink ("https://www.example.com/function/edit/2019-04-09)
This works:
ThisWorkbook.FollowHyperlink ("http://www.dereferer.org/?https://www.example.com/function/edit/2019-04-09)
(for my needs it's not required to encode the target URL)
As this redirect is slow and indirect, I'm searching for a way to directly open the targeted URL while using the existing session (if possible). If this isn't possible (for example because of security), what's the best/fastest way to redirect without setting up an own redirector (which redirects like dereferer.org over a GET parameter)?
A clunky and ill-advised workaround, but you could bypass FollowHyperlink, and instead use Shell to open the website in a new tab/window of your default web-browser:
Shell "explorer ""https://www.example.com/function/edit/2019-04-09"""
(As a note, if you type as a hyperlink in a cell and clicked on it manually, instead of using VBA FollowHyperlink, then the same issue would still occur. This also happens in Word and PowerPoint. Just be thankful you're not trying to catch the FollowHyperlink event and "correct" that in the window)
In response to comments - for Mac you will need to use "open" instead of "explorer". This code should run on both Mac or PC:
Shell IIf(Left(Application.Operatingsystem, 3)="Win","explorer ","open ") & _
"""https://www.example.com/function/edit/2019-04-09"""
If you are allowed to install selenium basic I would use that
Option Explicit
'download selenium https://github.com/florentbr/SeleniumBasic/releases/tag/v2.0.9.0
'Ensure latest applicable driver e.g. ChromeDriver.exe in Selenium folder
'VBE > Tools > References > Add reference to selenium type library
Public Sub DownloadFile()
Dim d As WebDriver
Set d = New ChromeDriver
Const URL = "url"
With d
.Start "Chrome"
.get URL
'login steps
.get 'otherUrl'
Stop '<delete me later
.Quit
End With
End Sub

Access web page body text using VBA & Selenium

I am trying to convert an Excel macro that currently uses Internet Explorer and use the following line of code to extract the web page’s <body> text
x = .Document.DocumentElement.InnerText
Using the Selenium demo, I am able to produce a jpg of the page with Chrome & IE, but Firefox just loads a blank page and IE64 & Edge don’t work on Windows 10.
I have been unable to find the proper VBA command with Selenium to copy the body text to variable ”x”. I only want to read it.
I am trying to do this to make my macro browser independent.
The macro is for my use only.
Jim
You are not making it browser agnostic. You are simply widening the choice of browser to those supported via selenium basic. This brings some problems of its own which you are noticing.
Folders containing the drivers must be on the environmental path or the path passed to selenium webdriver as an argument.
You should use the latest Chrome browser and Chrome driver
You cannot use the latest FireFox browser and driver. It is not supported. I think you need FF v.46.0.1.
If using IE then zoom must be to 100%.
I suggest browsing the issues pages of Github for further known issues
Heuristically, I have heard some banter about problems with Windows 10 and Selenium Basic - would be interested to know if anyone has got this working as I am not on that version.
Review the examples.xlsm provided by selenium basic GitHub site to see which other browsers are supported (e.g. Opera, PhantomJS, FirefoxLight,CEF).
With Chrome you can get the body text with this:
Option Explicit
Public Sub GetInfo()
Dim d As WebDriver, s As String
Set d = New ChromeDriver
Const URL = "https://www.neutrinoapi.com/api/api-examples/python/"
With d
.Start "Chrome"
.get URL
s = .FindElementByTag("body").Text
Debug.Print s
.Quit
End With
End Sub
Other info: https://stackoverflow.com/a/52294259/6241235

IE automation through Excel vba

The problem that I'm having is quite simple. I'm opening a webpage, looking for the input box where I type some text and then hit a Search button. Once the new webpage is uploaded I gather all the info I need. My problem is in the time spent uploading the webpage. My gathering code doesn't work because the new webpage is still not loaded. I have the following code to wait for that:
Do While ie.ReadyState <> READYSTATE_COMPLETE
DoEvents
Loop
where ie was set like this
Set ie = New InternetExplorer
Is there another code except the application.wait that I can use to fix this?
I've run into similar issues when attempting the same. The issue is that the ready state on the IE object can't always be trusted, or at the very least, it's not signaling what you think. For example it will let you know when each frame is ready, not the whole page. So if you don't actually need to see the web browser control, and you only care about sending and receiving data. My suggestion is to not bother rending the page in a web browser object, instead just send and receive data using a WinHttpRequest.
Tools>References>Microsoft WinHTTP Services
Using this, you can send and receive the HTML data directly. If your page uses URL parameters, you send a "GET" then parse the reply. Otherwise you will have to send a "PUT" and send the edited HTML (Basically take the blank form page you begin with and set all the values). When first using, it can be a bit tricky to get the formatting correct depending on the complexity of the page you are trying to automate. Find a good web dugging tool (such as Fiddler) so that you can see the HTML being sent to your target page.

Resources