Pulling text from website into Excel by Using VBA - excel

I am slowly exploring if I can use VBA to code a macro that will search a website from a list of keywords/codes in column A and extract the data. Currently The code below searches the desired website using the range in ("A1") only but does get to the right page with the data I wish to extract. In this case the Code in a1 is 100-52-7
Sub BrowseToSite()
Dim IE As New SHDocVw.InternetExplorer
Dim HTMLDoc As MSHTML.HTMLDocument
IE.Visible = True
IE.Navigate "https://apps.who.int/food-additives-contaminants-jecfa-database/Search.aspx"
Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop
IE.Document.forms("form1").Elements("ctl00$ContentPlaceHolder1$txtSearch").Value = Range("a1").Value
IE.Document.forms("form1").Elements("ctl00$ContentPlaceHolder1$btnSearch").Click
Set HTMLDoc = IE.Document
'Sheet1.Range("B1").Value = HTMLDoc.getElementsByClassName("sectionHead1")(0).innerText
End Sub
Now I wish to pull the "0-5 mg/kg bw (1996)" phrase on this page into Excel. I planned to do this by retriving the inner text within the class name however I run into an error Object Variable or With Block variable not set with the following line:
Sheet1.Range("B1").Value = HTMLDoc.getElementsByClassName("sectionHead1")(0).innerText

You can get rid of IE altogether and try using xmlhttp requests to make the script robust. What the following script does is send a get http requests first to scrape the value of certain parameters supposed to be used within post requests and then issue a post requests to parse the desired content.
This is one of the efficient ways how you can:
Option Explicit
Public Sub GetContent()
Const Url = "https://apps.who.int/food-additives-contaminants-jecfa-database/Search.aspx"
Dim oHttp As Object, oHtml As HTMLDocument, MyDict As Object
Dim DictKey As Variant, payload$, searchKeyword$
Set oHtml = New HTMLDocument
Set oHttp = CreateObject("MSXML2.XMLHTTP")
Set MyDict = CreateObject("Scripting.Dictionary")
'send get requests first to parse the value of "__VIEWSTATE", "__VIEWSTATEGENERATOR" e.t.c., as in oHtml.getElementById("__VIEWSTATE").Value
With oHttp
.Open "GET", Url, False
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36"
.send
oHtml.body.innerHTML = .responseText
End With
searchKeyword = "100-52-7" 'this is the search keyword you wanna use from your predefined search terms
'MyDict stores keys and values within dictionary, as in __VIEWSTATE = "some value" and so on
MyDict("__VIEWSTATE") = oHtml.getElementById("__VIEWSTATE").Value
MyDict("__VIEWSTATEGENERATOR") = oHtml.getElementById("__VIEWSTATEGENERATOR").Value
MyDict("__EVENTVALIDATION") = oHtml.getElementById("__EVENTVALIDATION").Value
MyDict("ctl00$ContentPlaceHolder1$txtSearch") = searchKeyword
MyDict("ctl00$ContentPlaceHolder1$btnSearch") = "Search"
MyDict("ctl00$ContentPlaceHolder1$txtSearchFEMA") = ""
'joining each set of key and value with ampersand to make it a string so that you can use it as a parameter while issuing post requests, which is what payload is doing
payload = ""
For Each DictKey In MyDict
payload = IIf(Len(DictKey) = 0, WorksheetFunction.encodeURL(DictKey) & "=" & WorksheetFunction.encodeURL(MyDict(DictKey)), _
payload & "&" & WorksheetFunction.encodeURL(DictKey) & "=" & WorksheetFunction.encodeURL(MyDict(DictKey)))
Next DictKey
With oHttp
.Open "POST", Url, False
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36"
.setRequestHeader "Content-type", "application/x-www-form-urlencoded"
.send (payload)
oHtml.body.innerHTML = .responseText
End With
MsgBox oHtml.querySelector("#SearchResultItem > a").NextSibling.NodeValue
End Sub
Make sure to add the following libraries to execute the above script:
Microsoft XML, v6.0
Microsoft Scripting Runtime
Microsoft HTML Object Library

You click on an element with this line of code:
IE.Document.forms("form1").Elements("ctl00$ContentPlaceHolder1$btnSearch").Click
for which IE makes a POST request to retrieve your results, as can be seen here:
The above is a screen shot from Edge's dev tools, but concept is the same
During this request, the element in question is not immediately there, so you will need to wait for it to load.
Your prior method of
Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop
would probably work, but I find it to be inconsistent at times and would also include checking the .Busy property as well.
Try using this after your click:
IE.Document.forms("form1").Elements("ctl00$ContentPlaceHolder1$btnSearch").Click
'~~WAIT FOR SEARCH RESULTS TO LOAD~~
Do While IE.ReadyState < READYSTATE_COMPLETE Or IE.Busy
Loop
Set HTMLDoc = IE.Document
Sheet1.Range("B1").Value = HTMLDoc.getElementsByClassName("sectionHead1")(0).innerText
If you're still having issues, you can force IE to wait for the element in question to become available by doing this:
On Error Resume Next
Do while HTMLDoc.getElementsByClassName("sectionHead1")(0) is Nothing
Loop
On Error Goto 0
Set HTMLDoc = IE.Document
Sheet1.Range("B1").Value = HTMLDoc.getElementsByClassName("sectionHead1")(0).innerText
This is a simple loop that checks for the object, and will continue to loop until that object is no longer Nothing (which essentially means it has loaded).
And I would recommend that you add some sort of timeout that may trigger an error or something just in case the webpage is having issues so you're not in an infinite loop.
Pro Tip:
If you are clicking the search button a lot of times and waiting for a
lot of objects to load, instead of duplicating the above code you can
turn it into it's own sub and do something like:
Sub WaitForElement(IE as InternetExplorer, elem As Object)
Do While IE.ReadyState < 4 Or IE.Busy: Loop
On Error Resume Next
Do While elem is Nothing: Loop
On error Goto 0
End Sub
Then you would just need to use the following line after each click:
WaitForElement IE, HTMLDoc.getElementsByClassName("sectionHead1")(0)
Not only would this cut down on the number of lines in your code, it could greatly improve readability as well.

Related

Trying to webscrape HTML page, using XMLHTMLRequest method from within Excel VBA, but gets Run-time error 438

Trying to mimick, exactly whats described here:
WebScraping only specific sections of a webpage in VBA
However, my code get an alright XMLHTTPRequest (Status=200) populated object, with a fine .responseText (with HTML), but then it breaks in this assignment (see code near bottom):
HTMLDoc.body.innerHTML = XMLReq.responseText
with the dreaded
Run-time error '438' Object doesn't support this property or method
from within Excel 2019 (ie 365) VBA.
I have tried to tweak it, but cannot come across this issue. I want to traverse some of the HTML element types, once it works.
The only forum answers, to this problem deals with IE, but I am beyond Internet Explorer ;-)
Any clues, anyone?
Sub GetMetaData()
' Original idea: https://stackoverflow.com/questions/37763179/how-to-get-meta-keywords-content-with-vba-from-source-code-in-an-excel-file
' Adapted by PR to work in general i.e. not being Browser specific in any way
' 2022-09-29
Dim webrequest As New MSXML2.XMLHTTP60
Dim responses As Object
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim HTMLElement As HTMLHtmlElement
Dim url As String: url = "https://access.redhat.com/support/policy/updates/errata"
Dim wk As Worksheet
Const META_TAG As String = "META"
Const META_NAME As String = "keywords"
Dim Doc As Object ' As work area, holding HTML
Set Doc = CreateObject("htmlfile") ' to fake ...
Dim metaElements As Object
Dim element As Object
Dim kwd As String
Dim err As Integer
Dim myarray As Variant: ReDim myarray(0 To 20, 0 To 5000)
Set wk = Worksheets(6) ' Hopefully a free worksheet
' Find the HTML document near the URL
With webrequest
.Open "GET", url, True ' Have to use Asynchronously (True), or I get error setting ResponseType !
.setRequestHeader "Content-Type", "text/html, */*" ' Replaced header, to be more specific
'.responseType = "document" ' We cannot operate on any other content than HTML. Tryin, gives Runtime error 438
' Added below header, to ensure its compatible.
.setRequestHeader "user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"
.send
Debug.Print "Request Status:", .Status, .statusText ' To see result of XMLHTTP60 request
HTMLDoc.Body.innnerHTML = .responseText ' CODE line giving error ' Pick up the HTML
Set info = HTMLDoc.getElementsByTagName(META_TAG)
Debug.Print "We now have Document META collection (nodelist): ", info
On Error Resume Next ' getting an error, just ignoring for now
End with

Can MSXML2.XMLHTTP be used with Chrome

I have been using the following Excel VBA macro to bring back data from a website. It worked fine until a few days ago when the website stopped supporting IE. Of course the macro just fails now as there is no data on the webpage to bring back to Excel, just a message saying, "Your browser, Internet Explorer, is no longer supported." Is there a way to have the "Get method" (MSXML2.XMLHTTP) use Chrome instead of IE to interact with the website? BTW, my default browser is already set to "Chrome".
Dim html_doc As HTMLDocument ' note: reference to Microsoft HTML Object Library must be set
Sub KS()
' Define product url
KS_url = "https://www.kingsoopers.com/p/r-w-knudsen-just-blueberry-juice/0007468210784"
' Collect data
Set html_doc = New HTMLDocument
Set xml_obj = CreateObject("MSXML2.XMLHTTP")
xml_obj.Open "GET", KS_url, False
xml_obj.send
html_doc.body.innerHTML = xml_obj.responseText
Set xml_obj = Nothing
KS_product = html_doc.getElementsByClassName("ProductDetails-header")(0).innerText
KS_price = "$" & html_doc.getElementsByClassName("kds-Price kds-Price--alternate mb-8")(1).Value
do Stuff
End Sub
The check for this is a basic server check on user agent. Tell it what it wants to "hear" by passing a supported browser in the UA header...(or technically, in this case, just saying the equivalent of: "Hi, I am not Internet Explorer".)
It can be as simple as xml.setRequestHeader "User-Agent", "Chrome". I said basic because you could even pass xml.setRequestHeader "User-Agent", "I am a unicorn", so it is likely an exclusion based list on the server for Internet Explorer.
Option Explicit
Public Sub KS()
Dim url As String
url = "https://www.kingsoopers.com/p/r-w-knudsen-just-blueberry-juice/0007468210784"
Dim html As MSHTML.HTMLDocument, xml As Object
Set html = New MSHTML.HTMLDocument
Set xml = CreateObject("MSXML2.XMLHTTP")
xml.Open "GET", url, False
xml.setRequestHeader "User-Agent", "Mozilla/5.0"
xml.send
html.body.innerHTML = xml.responseText
Debug.Print html.getElementsByClassName("ProductDetails-header")(0).innerText
Debug.Print "$" & html.getElementsByClassName("kds-Price kds-Price--alternate mb-8")(1).Value
Stop
End Sub
Compare that with adding no UA or adding xml.setRequestHeader "User-Agent", "MSIE".
Study the article here by Daniel Pineault and this paragraph:
Feature Browser Emulation
Also note my comment dated 2020-09-13.

VBA WebScraping Data Showing Backwards in cell

I am extracting some data from the web everything is coming of fine, except for one set of data, when it is extracted it is showing backwards in the cell.
I can not work out why it is showing backwards, as everything else is extracting fine .
Q) Could some please advise why it would do this?
This is is what i am using to pull the data, it works fine for everything else, just not this class it shows backwards in excel
Set doc = NewHTMLDocument(CStr(link))
'''IF Statement, change class to suite needs 'bscd
' On Error Resume Next
If doc.getElementsByClassName("bscd")(0) Is Nothing Then
wsSheet.Cells(StartRow + Counter, 5).Value = "-"
Else
' On Error Resume Next
wsSheet.Cells(StartRow + Counter, 5).Value = doc.getElementsByClassName("bscd")(0).Children(1).InnerText
End If
This is Class
Result Showing Backwards in excel
Could it be that the "Complete Information" is a JAVA dropdown?
Just did Ctrl+U as recommended and this is how the html is, it is showing backwards here, but displays correct on the site.
You need to click on that link to access the content. This is one of the ways how you can do. I used Explicit Wait within the script instead of hardcoded delay, so the script will wait up to 10 seconds until the content is visible.
Public driver As ChromeDriver
Sub ScrapeContent()
Const URL$ = "https://www.ebay.co.uk/itm/Metal-Floor-Fan-High-velocity-chrome-free-stand-fan-industrial-fan-3-8-Speed-UK/333664038024"
Dim oElem As Object, oItem As Object
Set driver = New ChromeDriver
driver.get URL
driver.FindElementByXPath("//span/a[contains(.,'Complete information')]", Timeout:=10000).Click
Set oElem = driver.FindElementByXPath("//span[contains(.,'Phone:')]/following::span", Timeout:=10000)
Set oItem = driver.FindElementByXPath("//span[contains(.,'Email:')]/following::span", Timeout:=10000)
Debug.Print oElem.Text, oItem.Text
End Sub
Output:
13025438495 eshijiali#outlook.com
If you use xmlhttp requests, the result you may get is reversed. However, I've used a function to make them regular:
Function reverseString(inputStr As String)
Dim myString$, I&
For I = Len(inputStr) To 1 Step -1
myString = myString & Mid(inputStr, I, 1)
Next I
reverseString = myString
End Function
Sub FetchData()
Const Url$ = "https://www.ebay.co.uk/itm/Metal-Floor-Fan-High-velocity-chrome-free-stand-fan-industrial-fan-3-8-Speed-UK/333664038024"
Dim HTML As New HTMLDocument, oPost As Object
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", Url, False
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
.send
HTML.body.innerHTML = .responseText
End With
Set oPost = HTML.getElementsByClassName("bsi-lbl")
If Not oPost Is Nothing And oPost.Length >= 1 Then
Debug.Print reverseString(oPost(0).NextSibling.innerText)
End If
If Not oPost Is Nothing And oPost.Length >= 2 Then
Debug.Print reverseString(oPost(1).NextSibling.innerText)
End If
End Sub
Output:
13025438495 eshijiali#outlook.com

Why does my code to Scrape Text using VBA works in Debug only

I have written some code to scrape specific dates from Google's patent website. After reviewing lots of examples I figured out the getElementsByClassName that gets the date I need. The code below works when I step through in debug mode and generates the desired MsgBox. But when I run it, it gives me "Run-time error '91': Object variable or With block variable not set."
I have added delays wherever I thought that might be an issue. I have also disassociated the code from any interaction with the Excel spreadsheet where I would ultimately put the date, just to make it as simple as possible. I've also copied the code from the original spreadsheet to a new blank one, but same issue.
Any help would be appreciated.
Sub Get_Date()
Dim ie As InternetExplorer
Dim sURL As String
Dim strGrant As Variant
Set ie = New InternetExplorer
sURL = "https://patents.google.com/patent/US6816842B1/en?oq=6816842"
ie.navigate sURL
ie.Visible = False
Do While ie.Busy Or ie.ReadyState < 4
DoEvents
Loop
strGrant = ie.document.getElementsByClassName("granted style-scope application-timeline")(0).innerText
Do While ie.Busy Or ie.ReadyState < 4
DoEvents
Loop
MsgBox strGrant
ie.Quit
End Sub
````
It's likely a timing issue as per my comment. That's dealt with in other answers to similar questions. Main things to consider are:
Use proper page load waits: While IE.Busy Or ie.readyState < 4: DoEvents: Wend
Possibly a timed loop to attempt to set the element to a variable then testing if set.
Alternatively, a bit of a punt but it seems that all granted dates are the same as publication dates (patent publication date). If this is true then you can use xhr to get the publication date
Option Explicit
Public Sub GetDates()
Dim html As HTMLDocument, i As Long, patents()
patents = Array("US7724240", "US6876312", "US8259073", "US7523862", "US6816842B1")
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
For i = LBound(patents) To UBound(patents)
.Open "GET", "https://patents.google.com/patent/" & patents(i) & "/en?oq=" & patents(i), False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
html.body.innerHTML = .responseText
If html.querySelectorAll("[itemprop=publicationDate]").length > 0 Then
Debug.Print html.querySelector("[itemprop=publicationDate]").DateTime
End If
Next
End With
End Sub

Scrape data that is not in the source code, using VBA

I'm trying to scrape whole div from one website. The data is not visible in the source code, it changes based on the variable in the URL (link).
I was looking for any solution to copy to the excel sheet everything from
<div id="div_measures_for_2103909010" class="measures_detail">
Unfortunately since there is no data in direct source code I have found a way to display only data from the div provided above Link
However to get this data I would need at first get the link to the direct data (the link is in the source code).
Do you have any idea how to deal with it the best possible way?
I've tried to download the source code, search for the link, open the link and copy all the data, but I have troubles downloading the source code (excel downloads only part of it due to cell data limitations). Here is my current code:
Sub Open_Webpage()
Set objHTTP = CreateObject("MSXML2.ServerXMLHTTP")
URL = "https://ec.europa.eu/taxation_customs/dds2/taric/measures.jsp?Lang=en&SimDate=20190329&Area=&MeasType=&StartPub=&EndPub=&MeasText=&GoodsText=&op=&Taric=2103909010&search_text=goods&textSearch=&LangDescr=pl&OrderNum=&Regulation=&measStartDat=&measEndDat="
objHTTP.Open "GET", URL, False
objHTTP.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
objHTTP.send ("")
html = objHTTP.responseText
Range("A1").Value = html
End Sub
If I am able to have full code in one cell I can then look for the link in the source code and use it:
=MID(LEFT(A1,FIND("' width='100%'",A1)-1),FIND("' src='",A1)+7,LEN(A1))
I know that there must be some better solution, but I'm not so proficient in VBA to figure it out...
You can regex out the required url, do a little cleaning then pass on to xhr. For some reason I was unable to simply use getAttribute("onclick") so had to use outerHTML (innerHTML also fine) on the element
Option Explicit
Public Sub GetInfo()
Dim html As HTMLDocument, s As String, re As Object, url As String
Set re = CreateObject("vbscript.regexp")
Set html = New HTMLDocument '< VBE > Tools > References > Microsoft Scripting Runtime
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://ec.europa.eu/taxation_customs/dds2/taric/measures.jsp?Lang=en&SimDate=20190329&Area=&MeasType=&StartPub=&EndPub=&MeasText=&GoodsText=&op=&Taric=2103909010&search_text=goods&textSearch=&LangDescr=pl&OrderNum=&Regulation=&measStartDat=&measEndDat=", False
.send
html.body.innerHTML = .responseText
s = html.querySelector("[id$='_end_goods']").outerHTML
With re
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = "measures_details\.jsp(.*)'\);"
If .Test(s) Then
url = "https://ec.europa.eu/taxation_customs/dds2/taric/measures_details.jsp" & .Execute(s)(0).SubMatches(0)
url = Replace$(url, "&", "&")
End If
End With
If Len(url) > 0 Then
.Open "GET", url, False
.send
html.body.innerHTML = .responseText
ActiveSheet.Cells(1, 1) = html.querySelector(".measures_detail").innerText
End If
End With
End Sub
Try the regex here
References:
VBE > Tools > References > Microsoft HTML Object Library

Resources