The problem that I'm having is quite simple. I'm opening a webpage, looking for the input box where I type some text and then hit a Search button. Once the new webpage is uploaded I gather all the info I need. My problem is in the time spent uploading the webpage. My gathering code doesn't work because the new webpage is still not loaded. I have the following code to wait for that:
Do While ie.ReadyState <> READYSTATE_COMPLETE
DoEvents
Loop
where ie was set like this
Set ie = New InternetExplorer
Is there another code except the application.wait that I can use to fix this?
I've run into similar issues when attempting the same. The issue is that the ready state on the IE object can't always be trusted, or at the very least, it's not signaling what you think. For example it will let you know when each frame is ready, not the whole page. So if you don't actually need to see the web browser control, and you only care about sending and receiving data. My suggestion is to not bother rending the page in a web browser object, instead just send and receive data using a WinHttpRequest.
Tools>References>Microsoft WinHTTP Services
Using this, you can send and receive the HTML data directly. If your page uses URL parameters, you send a "GET" then parse the reply. Otherwise you will have to send a "PUT" and send the edited HTML (Basically take the blank form page you begin with and set all the values). When first using, it can be a bit tricky to get the formatting correct depending on the complexity of the page you are trying to automate. Find a good web dugging tool (such as Fiddler) so that you can see the HTML being sent to your target page.
Related
To prepare for the eventual 'going away' of IE11, I've been trying to figure out how to replace a couple parts of my code. One involves launching IE and using that browser to scrape some pages. Is there an equivalent way to do the below in Edge? I don't see a way to add a reference to the Edge libraries like I did with 'Microsoft Internet Objects' and IE11.
Dim ie As InternetExplorerMedium: Set ie = New InternetExplorerMedium
Dim html As HTMLDocument
With ie
.Visible = False
.Navigate website 'string that's created above this code
End With
Do While ie.ReadyState <> READYSTATE_COMPLETE
DoEvents
Loop
Application.Wait Now + #12:00:10 AM#
Set html = ie.Document
Thanks everyone for your help.
Ok, a few explanations. I am writing these as a reply so as not to have to split them into several comments.
Does Edge work instead of IE to do web scraping with VBA?
It does not work directly. The reason is that IE has a COM interface (Wikipedia: Component Object Model). No other browser has this interface. Not even Edge.
But for Edge there is also a web driver for Selenium. Even provided directly by MS.
Another alternative - xhr
Since you can't use Selenium because you don't have admin rights, there might be the possibility to use xhr (XML HTTP Request). However, in order to make a statement on this, we would have to know the page that you want to scrape.
Xhr can be used directly from VBA because it does not use a browser. The big limitation is that only static content can be processed. No JavaScript is executed, so nothing is reloaded or generated dynamically in any other way. On the other hand, this option is much faster than browser solutions. Often, a static file provided by the web server is sufficient. This can be an HTML file, a JSON or another data exchange format.
There are many examples of using xhr with VBA here on SO. Take note of the possibility first as another approach. I can't explain the method exhaustively here, also because I don't know everything about it myself. But there are many ways to use it.
By the way
IE will finally be discontinued in June 2022 and will then also no longer be delivered with Windows. That's what I read on the German IT pages a few days ago. But there are already massive restrictions on the use of IE.
I'm trying to create a clone of getpocket.com for learning. On that app, every saved link gets converted into a markdown; and it seems like the it's a filtered content with only the page title and body without headers, footers, etc.
I could get the page's title using puppeteer api thru different means:
using page.title()
or get the page's opengraph "og:title"
But how do i get like the summarized version containing only the main content of the page.
Note that i don't know beforehand the "css class" of the main content since i'm planning on just entering a url in a textbox and scrape that site from there.
I have found what i've needed for this scenario.
I used the Readability.js library for making webpages readable by removing some certain html tags. Here's the library.
This library is what mozilla uses behind the scenes when rendering their reader view
I've seen three different ways to check if the page I'm navigating to is ready. As shown in the sample code below.
It seems to me Method 1 is the best, but hoping an expert out there can tell otherwise or even better... provide the right way to do it if there is something different.
Here's the sample code
Sub OpenBrowser()
Dim vOBJBROWSER As Object
Set vOBJBROWSER = CreateObject("InternetExplorer.Application")
vOBJBROWSER.Navigate "http://stackoverflow.com"
'Method 1
Do While vOBJBROWSER.Busy Or vOBJBROWSER.ReadyState <> 4
DoEvents
Loop
'Method 2
Do While vOBJBROWSER.ReadyState < 4
DoEvents
Loop
'Method 3
Do
Loop Until vOBJBROWSER.ReadyState = READYSTATE_COMPLETE
vOBJBROWSER.Visible = True
End Sub
The IE browser is going to make you really hate life in the long run.
Just like any browser'ed solution in webs scraping, you only need the browser, if you cant figure out what the resource is you're trying to load.
Consider all the over-head, javascript, CSS, potential tracking cookies, that accompany using a browser.
Now if you know what you want, and see in Chrome Dev Tools how it loads - you can use VBA's HTTP request libraries and you'll have a much better time.
The pro to using a HTTP request is that even it's a stream or chunked, you can control and easily measure when the message is done. A web page you'll always be stuck trying to figure out what the status code is, and sub frames, and all kinds of crap.
Highly recommend, channeling the frustration of IE automation into a learning experience with HTTP and chrome dev tools. You will 100% be less likely to smash your keyboard.
When I set response content type as Excel, the Open/Save dialog is shown twice , just on IE8. It works fine on other browsers (tested on Chrome/Firefox/Opera).
The code for setting response content type is:
response.setContentType("application/vnd.ms-excel");
response.setHeader("Content-disposition","attachment;filename=abc.xls");
I searched for solutions/workarounds. Turning off Smartscreen didn't help.
Also, another suggestion was to wait for 5-10 sec before clicking Save/Open. That too didn't work.
What's the cause of this? Are there any IE specific workarounds?
It's a pain but IE8 is still widely used by the users.
This is just a guess, but it could have something to do with the way Office (used to) embed itself in IE with plugins.
A workaround might be putting it in a zip file before sending it over to the user.
How to capture the Whole web page when using QTP?
I am aware of 'CaptureBitmap' method for the screenshot. But how to capture the Whole page? Help !!
What do you want to capture? If it's the HTML you can create a checkpoint on the Page test object and check the HTML source checkbox in the HTML verification section.
If you want to capture an image of the page then you can only capture the visible part with CaptureBitmap there is no way to get an image of the scrolled out parts (unless you scroll and use multiple captures).
Use Browser("").Capturebitmap.
This takes the screenshot of the visible browser.
Use the sendkeys method to do a page down, then use Browser("").Capturebitmap again!
A full screen shot can be taken by toggling QTP's run settings rather than using CaptureBitmap. We can tell QTP to always take screen shots, interact with the page (or object) we wish to capture (e.g. call .Exist(0)) and this will feed a screen shot in to the results.
The code to do this:
Dim App 'As Application
Set App = CreateObject("QuickTest.Application")
App.Options.Run.ImageCaptureForTestResults = "Always"
Browser("index:=0").Page("index:=0").sync
App.Options.Run.ImageCaptureForTestResults = "OnError"
Technically this seems to be capturing the html and then presenting this to the user in the run results, rather than an actual image of the browser's presentation of the html. But still, this means we can see what's on the page but not visible.
I have went through lot of surfing but couldn't get right answer or I coudn't implement what I found due to restriction of using third party APIs in my office. By using dot net factory, we can use dot net libraries to take screen shots and merge them. refer the below page for complete code
http://www.testbasket.com/2015/08/capture-whole-web-page-using-uftqtp.html
However here i have pasted the contents from the page and hope it helps.
In order to do take the screenshot of complete page, I have used DotNetFactory and System.Drawing dot net library.
Lets go step by step to the solution,
As part of implementing the solution, we need to get the height and weight of the entire page. In order to get that we using DOM of a page using .object method.
#Get the Full Height of Page
FullHeight = Browser("Wikipedia, the free encycloped").Object.document.body.scrollheight
#Get the Full width of Page
Fullwidth = Browser("Wikipedia, the free encycloped").Object.document.body.scrollwidth
Once we found the complete page size, we need to find the client size (how much browser can show)
#Get the visible height - Viewable part of the page
BrowserHeight = Browser("Wikipedia, the free encycloped").Object.document.body.clientHeight
#Get the visible width - Viewable part of the page
Browserwidth = Browser("Wikipedia, the free encycloped").Object.document.body.clientwidth
Next we need to import required dot net libraries using Dot Net Factory
Set oGraphics=DotNetFactory.CreateInstance("System.Drawing.Graphics")
Set oPoint=DotNetFactory.CreateInstance("System.Drawing.Point")
Set oImgFormat=DotNetFactory.CreateInstance("System.Drawing.Imaging.ImageFormat","System.Drawing", Nothing)
Set oImageLib = DotNetFactory.CreateInstance("System.Drawing.Image")
Set oPens=DotNetFactory.CreateInstance("System.Drawing.Pens","System.Drawing")
As a final step, we need to loop through the page and take screenprints separately. finally using Dotnet library we will merge the images using graphics. draw method. It is easy to implement, complete set of code is available in the above mentioned link for reference
If you would like a single screenshot of the whole page, try using SnagIt.
There's a handy PDF with more info on how to go about it (http://download.techsmith.com/snagit/docs/comserver/enu/snagitcom.pdf)
In QTP it might look like this:
Sub Capture_Scroll_Image ()
Set objShell = CreateObject("WScript.Shell")
Set oSnag = CreateObject("SNAGIT.ImageCapture")
oSnag.IncludeCursor = False
oSnag.OutputImageFile.FileType = 5
oSnag.OutputImageFile.FileNamingMethod = 1
oSnag.OutputImageFile.Directory = "C:\Screens\"
oSnag.OutputImageFile.Filename = "Name"
oSnag.EnablePreviewWindow = False
oSnag.AutoScrollOptions.AutoScrollMethod= 1
oSnag.Capture()
Wait (1)
objShell.SendKeys "{ENTER}"
capDone = oSnag.IsCaptureDone
Do Until oSnag.IsCaptureDone
Loop
Set oSnag=Nothing
Set objShell=NothingEnd Sub
End Sub