Unable to get rid of blank screen popping up while scraping content - excel

I've created a macro to parse the name of different movies traversing multiple pages from a torrent site. I used InternetExplorer in combination with Html.body.innerHTML parser (usually used with xmlhttp requests) to make the execution faster. Although the content of this site is not dynamic, I used IE to see how it behaves when it does the scraping in combination with Html.body.innerHTML.
When I run my script using IE, It parses the content as expected and finishes the job. The only problem I'm facing is that there is a blank screen out of nowhere popping up in every pagination like the image below.
How can I get rid of this blank screen issues?
My script (no issues when I use Html.body.innerHTML in combination with Chrome Driver):
Sub FetchContent()
Const link = "https://yts.am/browse-movies/0/all/action/0/latest?page="
Dim driver As New ChromeDriver, Html As New HTMLDocument, genre$
Dim post As HTMLDivElement, I&, R&
For I = 1 To 3
With driver
.AddArgument "--headless"
.get link & I
Html.body.innerHTML = .ExecuteScript("return document.documentElement.outerHTML")
End With
For Each post In Html.getElementsByClassName("browse-movie-bottom")
R = R + 1: Sheets(1).Cells(R, 1) = post.getElementsByClassName("browse-movie-title")(0).innerText
Next post
Next I
driver.Quit
End Sub
I would like to get the same behavior when I try using Html.body.innerHTML in combination with IE but it produces blank screens along with the desired result:
Sub FetchContent()
Const link = "https://yts.am/browse-movies/0/all/action/0/latest?page="
Dim IE As New InternetExplorer, Html As New HTMLDocument, genre$
Dim post As HTMLDivElement, I&, R&
For I = 1 To 4
With IE
.Visible = False
.navigate link & I
While .Busy Or .readyState < 4: DoEvents: Wend
Html.body.innerHTML = .document.DocumentElement.outerHTML
End With
For Each post In Html.getElementsByClassName("browse-movie-bottom")
R = R + 1: Sheets(1).Cells(R, 1) = post.getElementsByClassName("browse-movie-title")(0).innerText
Next post
Next I
IE.Quit
End Sub
Blank screen that appears in every pagination in case of IE combination with Html.body.innerHTML:

I try to test your code with IE and find that I am able to produce the issue.
I find that below line is causing this issue.
Html.body.innerHTML = .document.DocumentElement.outerHTML
Below is the modified version of your code in which this issue is resolved.
Sub FetchContent1()
Const link = "https://yts.am/browse-movies/0/all/action/0/latest?page="
Dim IE As New InternetExplorer, Html As New HTMLDocument, genre$
Dim post As HTMLDivElement, I&, R&
For I = 1 To 4
With IE
.Visible = False
.navigate link & I
While .Busy Or .readyState < 4: DoEvents: Wend
For Each post In .document.getElementsByClassName("browse-movie-bottom")
R = R + 1: Sheets(1).Cells(R, 1) = post.getElementsByClassName("browse-movie-title")(0).innerText
Next post
End With
Next I
IE.Quit
End Sub
Output in IE 11:
Now, It will not open any blank page in IE.

Related

Scrape with xmlhttp

I would like to get data from https://www.goaloong.net/football/6in1
This page contains a table.
I tried with:
Sub REQUESTXML()
Dim XMLHttpRequest As xmlHttp
Dim HTMLDoc As New HTMLDocument
Dim elem As Object
Dim x As Long
Set XMLHttpRequest = New MSXML2.xmlHttp
XMLHttpRequest.Open "GET", "https://www.goaloong.net/football/6in1", False
XMLHttpRequest.send
While XMLHttpRequest.readyState = 200
DoEvents
Wend
Debug.Print XMLHttpRequest.responseText
HTMLDoc.Body.innerHTML = XMLHttpRequest.responseText
x = 1
For Each elem In HTMLDoc.getElementsByClassName("Leaguestitle")
Sheets("req").Range("A" & x).Value = HTMLDoc.getElementsByTagName("a")(0).innerText
x = x + 1
Next elem
End Sub
I have no result.
Kindly help me?
The page https://www.goaloong.net/football/6in1 is dynamic, i.e. first the java scripts are loaded, then the scripts are loading the content. One approach is to load the full page content in IE and get it out of it. Example below (tested):
Sub REQUESTXML()
Dim IE As New InternetExplorer
Dim elem As Object
Dim x As Long
IE.navigate "https://www.goaloong.net/football/6in1"
Do While IE.readyState = READYSTATE_COMPLETE: DoEvents: Loop
Do Until IE.readyState = READYSTATE_COMPLETE: DoEvents: Loop
'for debug purpose
Open ThisWorkbook.Path & "\TESTFILE.html" For Output As #1
Print #1, IE.document.body.innerHTML
Close #1
x = 1
For Each elem In IE.document.getElementsByClassName("Leaguestitle")
Sheets(1).Range("A" & x).Value = elem.innerText
x = x + 1
Next elem
IE.Quit
End Sub
If you're ok with using a DLL and rewrite your code, you can run Microsoft's Edge browser (a Chrome-based browser) with VBA. With that you can do almost anything you want. Note however, that access to the DOM is performed by javascript, not by an object like
Dim IE As New InternetExplorer. Look at the VBA sample and you'll get the grasp.
https://github.com/peakpeak-github/libEdge
Sidenote: Samples for C# and C++ are also included.

Webscraping of product prices and specs

I'm new to the VBA world and i was trying to create a Excel file to get the cheapest price and name tag on a website. I've created one file that goes through the whole search list and gathers each product URL and places on a spreadsheet. My challenge now is in making this second code work. It starts out getting the information without any issues but then, after 10 or more URLs the macro gives a bug and it starts repeating the information on all the following cells.. Is there a way i can make the code run slower so it doesn't get into this problem?
I'll list here my code and a sample of the URLs im scraping..
here's the code i've been using:
Sub test()
Dim URL As String
Set ie = CreateObject("internetexplorer.application")
For i = 2 To 300
URL = Cells(i, 1).Value
ie.navigate URL
ie.Visible = False
Do While ie.busy And ie.readystate <> "readystate_complete"
DoEvents
Loop
Cells(i, 3) = ie.document.getElementsByTagName("h1")(0).innerText
Cells(i, 4) = ie.document.getElementsByTagName("strong")(0).innerText
Next i
ie.Quit
MsgBox "acabou"
End Sub
<https://www.zoom.com.br/notebook/notebook-vaio-vjf157f11x-b0211s-intel-core-i5-8250u-15-6-8gb-ssd-256-gb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-samsung-chromebook-plus-intel-celeron-3965y-12-2-4gb-emmc-32-gb-chrome-os-touchscreen>
<https://www.zoom.com.br/notebook/notebook-dell-xps-7390-intel-core-i7-10710u-13-3-16gb-ssd-512-gb-windows-10-touchscreen>
<https://www.zoom.com.br/notebook/notebook-dell-i15-3583-a5-intel-core-i7-8565u-15-6-8gb-hd-2-tb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-lenovo-b330-intel-core-i5-8250u-15-6-4gb-hd-1-tb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-dell-i15-7580-a20-intel-core-i7-8550u-15-6-8gb-hd-1-tb-geforce-mx150-windows-10>
<https://www.zoom.com.br/notebook/notebook-dell-i14-3480-u30-intel-core-i5-8265u-14-4gb-hd-1-tb-linux-8-geracao>
<https://www.zoom.com.br/notebook/macbook-pro-apple-muhn2bz-intel-core-i5-13-3-8gb-ssd-128-gb-tela-de-retina>
<https://www.zoom.com.br/notebook/notebook-multilaser-pc150-amd-a4-9120-14-2gb-emmc-32-gb-windows-10>
<https://www.zoom.com.br/notebook/notebook-samsung-np930qaa-kw1br-intel-core-i7-8550u-13-3-8gb-ssd-256-gb-windows-10-touchscreen>
<https://www.zoom.com.br/notebook/notebook-acer-a515-51g-58vh-intel-core-i5-7200u-15-6-8gb-hd-1-tb-geforce-940mx>
<https://www.zoom.com.br/notebook/notebook-multilaser-pc222-intel-celeron-dual-core-13-3-4gb-emmc-64-gb-windows-10>
<https://www.zoom.com.br/notebook/notebook-acer-pt515-51-788a-intel-core-i7-9750h-15-6-32gb-ssd-1-tb-geforce-rtx-2070-windows-10>
<https://www.zoom.com.br/notebook/notebook-acer-a315-53-53ak-intel-core-i5-7200u-15-6-4gb-hd-1-tb-windows-10-7-geracao>
<https://www.zoom.com.br/notebook/notebook-dell-i15-5584-m40-intel-core-i7-8565u-15-6-8gb-hd-2-tb-geforce-mx130-windows-10>
<https://www.zoom.com.br/notebook/notebook-acer-a315-41g-r21b-amd-ryzen-5-2500u-15-6-8gb-hd-1-tb-radeon-535-windows-10>
<https://www.zoom.com.br/notebook/notebook-positivo-master-n2140-intel-core-i3-7020u-14-4gb-hd-500-gb-windows-10-7-geracao>
<https://www.zoom.com.br/notebook/notebook-multilaser-pc101-intel-atom-14-1gb-ssd-32-gb-windows-10>
<https://www.zoom.com.br/notebook/notebook-lenovo-b330-intel-core-i5-8250u-15-6-8gb-hd-1-tb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-acer-an515-51-77fh-intel-core-i7-7700hq-15-6-8gb-hd-1-tb-geforce-gtx-1050-windows-10>
<https://www.zoom.com.br/notebook/notebook-dell-i15-3583-a2yp-intel-core-i5-8265u-15-6-4gb-optane-16-gb-hd-1-tb-windows-10>
<https://www.zoom.com.br/notebook/notebook-asus-g531gt-intel-core-i7-9750h-15-6-16gb-ssd-512-gb-geforce-gtx-1650-windows-10>
<https://www.zoom.com.br/notebook/notebook-vaio-fit-15s-intel-core-i3-7100u-15-6-4gb-hd-1-tb-windows-10-home>
<https://www.zoom.com.br/notebook/notebook-samsung-s50-intel-core-i7-7500u-13-3-8gb-ssd-256-gb-windows-10-style>
<https://www.zoom.com.br/notebook/notebook-lenovo-b330-intel-core-i3-7020u-15-6-4gb-ssd-120-gb-windows-10-7-geracao>
First of all:
Always declare all variables. To force this, always write Option Explicit as the first line in each module. This way, especially typos in variable names are immediately detected by the compiler.
Here is what to do about your problem:
IE is sometimes a real diva. For example, it doesn't like to have to process URLs in the same instance in quick succession. Therefore it is advisable to kick it out of memory and restart it for each new URL.
To restart it quickly, the deletion of coockies, the cache, etc. must not be set in its settings. Otherwise automation errors will occur.
Try this macro. With the given URLs it works:
Option Explicit
Sub test()
Dim URL As String
Dim ie As Object
Dim i As Long
For i = 2 To 300
If i > 14 Then
ActiveWindow.SmallScroll down:=1
End If
URL = ActiveSheet.Cells(i, 1).Value
Set ie = CreateObject("internetexplorer.application")
ie.navigate URL
ie.Visible = False
Do While ie.readystate <> 4: DoEvents: Loop
ActiveSheet.Cells(i, 3) = ie.document.getElementsByTagName("h1")(0).innerText
ActiveSheet.Cells(i, 4) = ie.document.getElementsByTagName("strong")(0).innerText
ie.Quit
Set ie = Nothing
Next i
MsgBox "acabou"
End Sub
Try this
Sub GetPrices()
Dim html As MSHTML.HTMLDocument, r As Long
For r = 1 To 4
Set html = GetHTML(Cells(r, 1).Value)
Cells(r, 3).Value = html.querySelector("h1.product-name").innerText
Cells(r, 4).Value = Replace(Replace(html.querySelector(".product-price").innerText, "a partir de ", vbNullString), ":( ", "")
Set html = Nothing
Next r
End Sub
Function GetHTML(ByVal sURL As String) As HTMLDocument
Dim http As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument
Set http = New MSXML2.XMLHTTP60
Set html = New MSHTML.HTMLDocument
With http
.Open "Get", sURL, False
.send
html.body.innerHTML = .responseText
End With
Set GetHTML = html
End Function
In python I will do :
You should first import library
from time import sleep
Two functions available :
this function let you sleep before execute every 3 seconds
time.sleep(3)
this function let you random sleep between every execute from 1 to 3 seconds
sleep(randint(1, 3)
Note: take aware about :
cookies because sometimes you need to post request to scrape some id before get.
the syntax of your header and set correctly origin & referrer parameters.

When the search button is clicked using vba the text entered in search box is not seen by web page

I have written vba code for entering manufacturer part number in search box of below website and clicking on search icon. It is able enter manufacturer part number in search box and click on search icon, but when "search icon is clicked the text entered in the text box is not picked up". It searches empty data.
'HTML Part for search icon
<em class="fa fa-search" aria-hidden="true" style="color: gray;"></em>
It being almost a month I have tried various different way which was also mentioned on stack overflow, like using "createEvent("keyboardevent")" but nothing worked.
' VBA code
Sub AptivScrapping()
Dim IE As SHDocVw.InternetExplorer
Set IE = New InternetExplorer
IE.Visible = True
IE.navigate "https://ecat.aptiv.com"
Do While IE.readyState < READYSTATE_COMPLETE
Loop
Dim idoc As MSHTML.HTMLDocument
Set idoc = IE.document
idoc.getElementById("searchUserInput").Value = "33188785"
Dim doc_ele As MSHTML.IHTMLElement
Dim doc_eles As MSHTML.IHTMLElementCollection
Set doc_eles = idoc.getElementsByTagName("a")
For Each doc_ele In doc_eles
If doc_ele.getAttribute("ng-click") = "SearchButtonClick(1)" Then
doc_ele.Click
Exit Sub
Else
End If
Next doc_ele
End Sub
The page does an xhr request to retrieve the search results. You can find it in the network tab after clicking submit. This means you can avoid, in this case, the expense of a browser and issue an xhr request. The response is json so you do need a json parser to handle the results.
I would use jsonconverter.bas to parse the json. After installing the code from that link in a standard module called JsonConverter, go to VBE > Tools > References > Add a reference to Microsoft Scripting Runtime
I dimension an array to hold the results. I determine rows from the number of items in the json collection returned and the number of columns from the size of the first item dictionary. I loop the json object, and inner loop the dictionary keys of each dictionary in collection, and populate the array. I write the array out in one go at end which is less i/o expensive.
Option Explicit
Public Sub GetInfo()
Dim json As Object, ws As Worksheet, headers()
Dim item As Object, key As Variant, results(), r As Long, c As Long
Set ws = ThisWorkbook.Worksheets("Sheet1")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://ecat.aptiv.com/json/eCatalogSearch/SearchProducts?filter=All&options=&pageSize=10&search=33188785", False
.send
Set json = JsonConverter.ParseJson(.responseText)("Products")
End With
headers = json.item(1).keys
ReDim results(1 To json.Count, 1 To UBound(headers) + 1)
For Each item In json
r = r + 1: c = 1
For Each key In item.keys
results(r, c) = item(key)
c = c + 1
Next
Next
With ws
.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
You can do this instead:
txt = "33188785"
IE.navigate "https://ecat.aptiv.com/feature?search=" & txt
This will take you straight to the Search Result.
Code:
Sub AptivScrapping()
Dim IE As SHDocVw.InternetExplorer
Dim txt As String
Set IE = New InternetExplorer
txt = "33188785"
IE.Visible = True
IE.navigate "https://ecat.aptiv.com/feature?search=" & txt
Do While IE.Busy
Application.Wait DateAdd("s", 1, Now)
Loop
End Sub
This will be faster as You will only have to load one page.
Why that's happening, i am not sure, but seems like the TextBox that is used to input text is not being Activated when adding text automatically to it. It is being activated when we click inside it.
I got the solution for above problem from Mrxel.com below is the link for that post.
https://www.mrexcel.com/forum/excel-questions/1105434-vba-ie-automation-issue-angularjs-input-text-post5317832.html#post5317832
In this case I need to enter the search string character by character and sendKeys and input events inside the loop. Below is the working vba code.
Sub AptivScrapping()
Dim IE As SHDocVw.InternetExplorer
Set IE = New InternetExplorer
IE.Visible = True
IE.navigate "https://ecat.aptiv.com"
Do While IE.readyState < READYSTATE_COMPLETE
Loop
Dim idoc As MSHTML.HTMLDocument
Set idoc = IE.document
IE.document.getElementById("searchUserInput").Focus = True
IE.document.getElementById("searchUserInput").Select
sFieldInput = "33188785"
For s = 1 To Len(sFieldInput)
Application.SendKeys Mid(sFieldInput, s, 1)
While IE.readyState < 4 Or IE.Busy
Application.Wait DateAdd("s", LoopSeconds, Now)
Wend
Next s
IE.document.getElementById("searchUserInput").Focus = False
Dim doc_ele As MSHTML.IHTMLElement
Dim doc_eles As MSHTML.IHTMLElementCollection
Set doc_eles = idoc.getElementsByTagName("a")
For Each doc_ele In doc_eles
If doc_ele.getAttribute("ng-click") = "SearchButtonClick(1)" Then
doc_ele.Click
Exit Sub
Else
End If
Next doc_ele
End Sub

.ReadyState & .Busy Is Not Waiting For IE Page To Load

I have some VBA to launch a company intranet site which will bring me directly to the document I am searching for. I need to wait for the page to finish loading, and then hit the "Print" button which will open the document in a Adobe Reader supported IE tab, and from there I save it as a PDF to a drive.
My issue is that the loop I have to wait until the webpage is loaded does not properly wait. Doing some research on SO, I've seen this is a known issue with newer versions of IE. I have since tried playing with some XMLHTTP methods, but I am unfamiliar with those, and my attempts with it have also fallen short (not sure how I would navigate to the next page by hitting the Print link using XMLHTTP).
My current VBA is as follows, and ieApp is New InternetExplorerMedium.
Set objShell = CreateObject("Shell.Application")
IE_Count = 0
IE_Count = objShell.Windows.Count
For x = 0 To (IE_Count - 1)
On Error Resume Next
my_url = ""
my_title = ""
my_url = objShell.Windows(x).Document.Location
my_title = objShell.Windows(x).Document.Title
If my_url Like "http://ctdayppv002/Home/DocViewer?" & "*" Then
Set ie = objShell.Windows(x)
Do While ieApp.ReadyState <> 4 And ie.Busy
DoEvents
Loop
For Each ee In ie.Document.getElementsByTagName("a")
If ee.ID = "printDocLink" Then
ee.Click: DoEvents: Sleep 1500
Do While ie.ReadyState <> 4 And ie.Busy
DoEvents
Loop
Exit For
End If
Next ee
Exit For
Else
End If
Next
If I add a bunch of Sleep time, then it will wait, until a document comes up that exceeds the time I told it to Sleep, so obviously that isn't a reliable solution.
Using the following questions for reference, I have tried to use XMLHTTP, but also noticed comments that this method may not work with JavaScript sites.
VBA hanging on ie.busy and readystate check
web scraping with vba using XMLHTTP
One of my attempts with XMLHTTP:
Public ieApp As MSXML2.XMLHTTP60
Set ieApp = New MSXML2.XMLHTTP60
With ieApp
.Open "GET", urlString, False
.send
While ieApp.ReadyState <> 4
DoEvents
Wend
Dim HTMLDoc As MSHTML.HTMLDocument
Dim HTMLBody As MSHTML.HTMLBody
Set HTMLDoc = New MSHTML.HTMLDocument
Set HTMLBody = HTMLDoc.body
HTMLBody.innerHTML = ieApp.responseText
Debug.Print HTMLBody.innerHTML
End With
Within the resulting HTMLBody.innerHTML I do not see the "printDocLink" element.
FYI - I have been emailing a rep from the company that created the website database, and they do not believe there is an API call that can directly export as a PDF, which I was hoping would be available to skip over the "Print" button entirely.
Following the advice from Tim Williams and QHarr, I found a solution that works for me.
I added a Do Until, and also a timer for 6 seconds:
t = Now + TimeValue("0:00:6")
Do Until .Document.getElementById("printDocLink") <> 0
DoEvents: Sleep 1000
If Now > t Then
Call Not_Found_PPV(N, searchitem)
.Quit
Set ieApp = Nothing
GoTo NxtInv
End If
Loop

Why does my code to Scrape Text using VBA works in Debug only

I have written some code to scrape specific dates from Google's patent website. After reviewing lots of examples I figured out the getElementsByClassName that gets the date I need. The code below works when I step through in debug mode and generates the desired MsgBox. But when I run it, it gives me "Run-time error '91': Object variable or With block variable not set."
I have added delays wherever I thought that might be an issue. I have also disassociated the code from any interaction with the Excel spreadsheet where I would ultimately put the date, just to make it as simple as possible. I've also copied the code from the original spreadsheet to a new blank one, but same issue.
Any help would be appreciated.
Sub Get_Date()
Dim ie As InternetExplorer
Dim sURL As String
Dim strGrant As Variant
Set ie = New InternetExplorer
sURL = "https://patents.google.com/patent/US6816842B1/en?oq=6816842"
ie.navigate sURL
ie.Visible = False
Do While ie.Busy Or ie.ReadyState < 4
DoEvents
Loop
strGrant = ie.document.getElementsByClassName("granted style-scope application-timeline")(0).innerText
Do While ie.Busy Or ie.ReadyState < 4
DoEvents
Loop
MsgBox strGrant
ie.Quit
End Sub
````
It's likely a timing issue as per my comment. That's dealt with in other answers to similar questions. Main things to consider are:
Use proper page load waits: While IE.Busy Or ie.readyState < 4: DoEvents: Wend
Possibly a timed loop to attempt to set the element to a variable then testing if set.
Alternatively, a bit of a punt but it seems that all granted dates are the same as publication dates (patent publication date). If this is true then you can use xhr to get the publication date
Option Explicit
Public Sub GetDates()
Dim html As HTMLDocument, i As Long, patents()
patents = Array("US7724240", "US6876312", "US8259073", "US7523862", "US6816842B1")
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
For i = LBound(patents) To UBound(patents)
.Open "GET", "https://patents.google.com/patent/" & patents(i) & "/en?oq=" & patents(i), False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
html.body.innerHTML = .responseText
If html.querySelectorAll("[itemprop=publicationDate]").length > 0 Then
Debug.Print html.querySelector("[itemprop=publicationDate]").DateTime
End If
Next
End With
End Sub

Resources