Scrape data that is not in the source code, using VBA

Scrape data that is not in the source code, using VBA - excel

I'm trying to scrape whole div from one website. The data is not visible in the source code, it changes based on the variable in the URL (link).
I was looking for any solution to copy to the excel sheet everything from
<div id="div_measures_for_2103909010" class="measures_detail">
Unfortunately since there is no data in direct source code I have found a way to display only data from the div provided above Link
However to get this data I would need at first get the link to the direct data (the link is in the source code).
Do you have any idea how to deal with it the best possible way?
I've tried to download the source code, search for the link, open the link and copy all the data, but I have troubles downloading the source code (excel downloads only part of it due to cell data limitations). Here is my current code:
Sub Open_Webpage()
Set objHTTP = CreateObject("MSXML2.ServerXMLHTTP")
URL = "https://ec.europa.eu/taxation_customs/dds2/taric/measures.jsp?Lang=en&SimDate=20190329&Area=&MeasType=&StartPub=&EndPub=&MeasText=&GoodsText=&op=&Taric=2103909010&search_text=goods&textSearch=&LangDescr=pl&OrderNum=&Regulation=&measStartDat=&measEndDat="
objHTTP.Open "GET", URL, False
objHTTP.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
objHTTP.send ("")
html = objHTTP.responseText
Range("A1").Value = html
End Sub
If I am able to have full code in one cell I can then look for the link in the source code and use it:
=MID(LEFT(A1,FIND("' width='100%'",A1)-1),FIND("' src='",A1)+7,LEN(A1))
I know that there must be some better solution, but I'm not so proficient in VBA to figure it out...

You can regex out the required url, do a little cleaning then pass on to xhr. For some reason I was unable to simply use getAttribute("onclick") so had to use outerHTML (innerHTML also fine) on the element
Option Explicit
Public Sub GetInfo()
Dim html As HTMLDocument, s As String, re As Object, url As String
Set re = CreateObject("vbscript.regexp")
Set html = New HTMLDocument '< VBE > Tools > References > Microsoft Scripting Runtime
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://ec.europa.eu/taxation_customs/dds2/taric/measures.jsp?Lang=en&SimDate=20190329&Area=&MeasType=&StartPub=&EndPub=&MeasText=&GoodsText=&op=&Taric=2103909010&search_text=goods&textSearch=&LangDescr=pl&OrderNum=&Regulation=&measStartDat=&measEndDat=", False
.send
html.body.innerHTML = .responseText
s = html.querySelector("[id$='_end_goods']").outerHTML
With re
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = "measures_details\.jsp(.*)'\);"
If .Test(s) Then
url = "https://ec.europa.eu/taxation_customs/dds2/taric/measures_details.jsp" & .Execute(s)(0).SubMatches(0)
url = Replace$(url, "&", "&")
End If
End With
If Len(url) > 0 Then
.Open "GET", url, False
.send
html.body.innerHTML = .responseText
ActiveSheet.Cells(1, 1) = html.querySelector(".measures_detail").innerText
End If
End With
End Sub
Try the regex here
References:
VBE > Tools > References > Microsoft HTML Object Library

Related

Data values of uploaded images

There are a large number of classes ("df-table") on that link homepage. The first class was able to be entered into the Excel sheet, but not the second class ("df-table"), the third class ("df-table"), etc.
Various internet resources say it can be solved by using nth-of-type or using xpath, but the error keeps occurring.
I want the data values of the uploaded images.
Public Sub Jaemu()
Application.ScreenUpdating = False
Application.Calculation = xlCalculationManual
Dim d As WebDriver, ws As Worksheet, URL As String
Set d = New ChromeDriver
Set ws = ThisWorkbook.Worksheets("gemstone2")
Dim http As New WinHttpRequest
With d
'.AddArgument "--headless"
.Start "Chrome"
Dim html As HTMLDocument
Dim JsonObject As Object
Set html = New HTMLDocument
URL = "https://globalmonitor.einfomax.co.kr/infomax_ds.html#/USA/1/1"
.get URL, Raise:=False ' set raise to false to avoid a timeout error
d.FindElementByCss("[ng-click='openStockSearchPopup();']").Click
d.FindElementByCss("[ng-enter='searchStockSearchPopup(true);']").SendKeys "GOOGL"
d.FindElementByCss("[ng-click='searchStockSearchPopup(true);']").Click
d.FindElementByCss("[class='slick-cell l1 r1 text-center clickable']").Click
Cells(2, 1).Value = d.FindElementByCss("[class='df-table']").Text
Cells(3, 1).Value = d.FindElementByCss(".table-contents[ng-if='IS_RT_STATE_SUCCESS(requeststate.prospectData)'] > .df-table").Text
End With
End Sub
Original OP error:
runtime error 32.
Coding line.
Cells(3, 1).Value = d.FindElementByCss("[class='df-table' class:nth-of-type(2)]").Text
New error following initial suggestion to use different CSS selector:
Runtime error 7
Coding line.
Cells(3, 1).Value = d.FindElementByCss(".table-contents[ng-if='IS_RT_STATE_SUCCESS(requeststate.prospectData)'] > .df-table").Text

Initial error (RTE32):
The :nth-of-type() pseudo class selector would go outside of the attribute selector closing ] i.e. "[class='df-table']:nth-of-type(2)", however this does not provide a match.
You can use:
.table-contents[ng-if='IS_RT_STATE_SUCCESS(requeststate.prospectData)'] > .df-table
This returns a single match and is specific enough, and not reliant on a long selector list, that it is likely fairly stable. Based on experience.
If you had wanted to use :nth-of-type() you could have used it on a parent element and taken first match, but that is less stable and prone to breaking if html changes:
.contents-area:nth-of-type(5) .df-table
Follow-up error (RTE7):
The element needs to be scrolled into view.
A not ideal way is (your mileage may vary):
d.ExecuteScript "window.scrollTo(0, document.body.scrollHeight/3);"
Application.Wait Now + TimeSerial(0, 0, 2)
Activesheet.Cells(1,3) = d.FindElementByCss(".table-contents[ng-if='IS_RT_STATE_SUCCESS(requeststate.prospectData)'] > .df-table").Text
There are better ways with inbuilt functions and custom JS which you can likely pull from the examples.xlsm by the author on GitHub. As I can only test with Python this was a quick and dirty test that worked.
You could avoid overhead of browser and use XHR to get a JSON response to then use a JSON parser with, as follows:
Option Explicit
Public Sub GetData()
Dim s As String
With CreateObject("MSXML2.XMLHTTP")
.Open "POST", "https://globalmonitor.einfomax.co.kr/facset/getKeyData", False
.SetRequestHeader "User-Agent", "Mozilla/5.0"
.SetRequestHeader "content-type", "application/json;charset=UTF-8"
.send "{""param"":""NAS:AAPL""}"
s = .responsetext
End With
Debug.Print s
End Sub

Pulling text from website into Excel by Using VBA

I am slowly exploring if I can use VBA to code a macro that will search a website from a list of keywords/codes in column A and extract the data. Currently The code below searches the desired website using the range in ("A1") only but does get to the right page with the data I wish to extract. In this case the Code in a1 is 100-52-7
Sub BrowseToSite()
Dim IE As New SHDocVw.InternetExplorer
Dim HTMLDoc As MSHTML.HTMLDocument
IE.Visible = True
IE.Navigate "https://apps.who.int/food-additives-contaminants-jecfa-database/Search.aspx"
Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop
IE.Document.forms("form1").Elements("ctl00$ContentPlaceHolder1$txtSearch").Value = Range("a1").Value
IE.Document.forms("form1").Elements("ctl00$ContentPlaceHolder1$btnSearch").Click
Set HTMLDoc = IE.Document
'Sheet1.Range("B1").Value = HTMLDoc.getElementsByClassName("sectionHead1")(0).innerText
End Sub
Now I wish to pull the "0-5 mg/kg bw (1996)" phrase on this page into Excel. I planned to do this by retriving the inner text within the class name however I run into an error Object Variable or With Block variable not set with the following line:
Sheet1.Range("B1").Value = HTMLDoc.getElementsByClassName("sectionHead1")(0).innerText

You can get rid of IE altogether and try using xmlhttp requests to make the script robust. What the following script does is send a get http requests first to scrape the value of certain parameters supposed to be used within post requests and then issue a post requests to parse the desired content.
This is one of the efficient ways how you can:
Option Explicit
Public Sub GetContent()
Const Url = "https://apps.who.int/food-additives-contaminants-jecfa-database/Search.aspx"
Dim oHttp As Object, oHtml As HTMLDocument, MyDict As Object
Dim DictKey As Variant, payload$, searchKeyword$
Set oHtml = New HTMLDocument
Set oHttp = CreateObject("MSXML2.XMLHTTP")
Set MyDict = CreateObject("Scripting.Dictionary")
'send get requests first to parse the value of "__VIEWSTATE", "__VIEWSTATEGENERATOR" e.t.c., as in oHtml.getElementById("__VIEWSTATE").Value
With oHttp
.Open "GET", Url, False
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36"
.send
oHtml.body.innerHTML = .responseText
End With
searchKeyword = "100-52-7" 'this is the search keyword you wanna use from your predefined search terms
'MyDict stores keys and values within dictionary, as in __VIEWSTATE = "some value" and so on
MyDict("__VIEWSTATE") = oHtml.getElementById("__VIEWSTATE").Value
MyDict("__VIEWSTATEGENERATOR") = oHtml.getElementById("__VIEWSTATEGENERATOR").Value
MyDict("__EVENTVALIDATION") = oHtml.getElementById("__EVENTVALIDATION").Value
MyDict("ctl00$ContentPlaceHolder1$txtSearch") = searchKeyword
MyDict("ctl00$ContentPlaceHolder1$btnSearch") = "Search"
MyDict("ctl00$ContentPlaceHolder1$txtSearchFEMA") = ""
'joining each set of key and value with ampersand to make it a string so that you can use it as a parameter while issuing post requests, which is what payload is doing
payload = ""
For Each DictKey In MyDict
payload = IIf(Len(DictKey) = 0, WorksheetFunction.encodeURL(DictKey) & "=" & WorksheetFunction.encodeURL(MyDict(DictKey)), _
payload & "&" & WorksheetFunction.encodeURL(DictKey) & "=" & WorksheetFunction.encodeURL(MyDict(DictKey)))
Next DictKey
With oHttp
.Open "POST", Url, False
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36"
.setRequestHeader "Content-type", "application/x-www-form-urlencoded"
.send (payload)
oHtml.body.innerHTML = .responseText
End With
MsgBox oHtml.querySelector("#SearchResultItem > a").NextSibling.NodeValue
End Sub
Make sure to add the following libraries to execute the above script:
Microsoft XML, v6.0
Microsoft Scripting Runtime
Microsoft HTML Object Library

You click on an element with this line of code:
IE.Document.forms("form1").Elements("ctl00$ContentPlaceHolder1$btnSearch").Click
for which IE makes a POST request to retrieve your results, as can be seen here:
The above is a screen shot from Edge's dev tools, but concept is the same
During this request, the element in question is not immediately there, so you will need to wait for it to load.
Your prior method of
Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop
would probably work, but I find it to be inconsistent at times and would also include checking the .Busy property as well.
Try using this after your click:
IE.Document.forms("form1").Elements("ctl00$ContentPlaceHolder1$btnSearch").Click
'~~WAIT FOR SEARCH RESULTS TO LOAD~~
Do While IE.ReadyState < READYSTATE_COMPLETE Or IE.Busy
Loop
Set HTMLDoc = IE.Document
Sheet1.Range("B1").Value = HTMLDoc.getElementsByClassName("sectionHead1")(0).innerText
If you're still having issues, you can force IE to wait for the element in question to become available by doing this:
On Error Resume Next
Do while HTMLDoc.getElementsByClassName("sectionHead1")(0) is Nothing
Loop
On Error Goto 0
Set HTMLDoc = IE.Document
Sheet1.Range("B1").Value = HTMLDoc.getElementsByClassName("sectionHead1")(0).innerText
This is a simple loop that checks for the object, and will continue to loop until that object is no longer Nothing (which essentially means it has loaded).
And I would recommend that you add some sort of timeout that may trigger an error or something just in case the webpage is having issues so you're not in an infinite loop.
Pro Tip:
If you are clicking the search button a lot of times and waiting for a
lot of objects to load, instead of duplicating the above code you can
turn it into it's own sub and do something like:
Sub WaitForElement(IE as InternetExplorer, elem As Object)
Do While IE.ReadyState < 4 Or IE.Busy: Loop
On Error Resume Next
Do While elem is Nothing: Loop
On error Goto 0
End Sub
Then you would just need to use the following line after each click:
WaitForElement IE, HTMLDoc.getElementsByClassName("sectionHead1")(0)
Not only would this cut down on the number of lines in your code, it could greatly improve readability as well.

Can MSXML2.XMLHTTP be used with Chrome

I have been using the following Excel VBA macro to bring back data from a website. It worked fine until a few days ago when the website stopped supporting IE. Of course the macro just fails now as there is no data on the webpage to bring back to Excel, just a message saying, "Your browser, Internet Explorer, is no longer supported." Is there a way to have the "Get method" (MSXML2.XMLHTTP) use Chrome instead of IE to interact with the website? BTW, my default browser is already set to "Chrome".
Dim html_doc As HTMLDocument ' note: reference to Microsoft HTML Object Library must be set
Sub KS()
' Define product url
KS_url = "https://www.kingsoopers.com/p/r-w-knudsen-just-blueberry-juice/0007468210784"
' Collect data
Set html_doc = New HTMLDocument
Set xml_obj = CreateObject("MSXML2.XMLHTTP")
xml_obj.Open "GET", KS_url, False
xml_obj.send
html_doc.body.innerHTML = xml_obj.responseText
Set xml_obj = Nothing
KS_product = html_doc.getElementsByClassName("ProductDetails-header")(0).innerText
KS_price = "$" & html_doc.getElementsByClassName("kds-Price kds-Price--alternate mb-8")(1).Value
do Stuff
End Sub

The check for this is a basic server check on user agent. Tell it what it wants to "hear" by passing a supported browser in the UA header...(or technically, in this case, just saying the equivalent of: "Hi, I am not Internet Explorer".)
It can be as simple as xml.setRequestHeader "User-Agent", "Chrome". I said basic because you could even pass xml.setRequestHeader "User-Agent", "I am a unicorn", so it is likely an exclusion based list on the server for Internet Explorer.
Option Explicit
Public Sub KS()
Dim url As String
url = "https://www.kingsoopers.com/p/r-w-knudsen-just-blueberry-juice/0007468210784"
Dim html As MSHTML.HTMLDocument, xml As Object
Set html = New MSHTML.HTMLDocument
Set xml = CreateObject("MSXML2.XMLHTTP")
xml.Open "GET", url, False
xml.setRequestHeader "User-Agent", "Mozilla/5.0"
xml.send
html.body.innerHTML = xml.responseText
Debug.Print html.getElementsByClassName("ProductDetails-header")(0).innerText
Debug.Print "$" & html.getElementsByClassName("kds-Price kds-Price--alternate mb-8")(1).Value
Stop
End Sub
Compare that with adding no UA or adding xml.setRequestHeader "User-Agent", "MSIE".

Study the article here by Daniel Pineault and this paragraph:
Feature Browser Emulation
Also note my comment dated 2020-09-13.

How to pull image from Amazon UK

I am trying to download the hi-res pictures from Amazon.co.uk. I tried the code given in the
[thread][1] and I am getting some issues.
the code by #QHarr works well for given Amazon.in website but when I try for Amazon.co.uk the
.querySelector("#landingImage").getAttribute("data-old-hires") returns nothing. here is the code I am testing.
Public Sub GetInfo()
Dim Html As HTMLDocument, results()
Set Html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.amazon.co.uk/dp/product/B01GFJWHZ0", True
.send
Html.body.innerHTML = .responseText
With ThisWorkbook.Worksheets("Sheet1")
.Cells(1, 2) = Html.querySelector("#productTitle").innerText
.Cells(1, 2).Offset(0, 1) = Html.querySelector("#landingImage").getAttribute("data-old-hires")
End With
End With
End Sub
any idea what exactly I am doing wrong?

My guess is the response html is mangled. You can easily regex out that bit from the response string however.
Note that the async argument is also set to False to allow time for data to load. I'm somewhat surprised your code got as far as you said it did. It should also have tripped over on the fact there is no match for that selector in the HTMLDocument.
Public Sub GetInfo()
'tools > references > Microsoft HTML Object Library
Dim re As Object, html As MSHTML.HTMLDocument, xhr As Object, s As String
Set re = CreateObject("VBScript.RegExp")
Set xhr = CreateObject("MSXML2.XMLHTTP")
Set html = New MSHTML.HTMLDocument
re.Pattern = """hiRes"":""(.*?)"""
With xhr
.Open "GET", "https://www.amazon.co.uk/dp/product/B01GFJWHZ0", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
html.body.innerHTML = s
End With
With ThisWorkbook.Worksheets("Sheet1")
.Cells(1, 2) = html.querySelector("#productTitle").innerText
.Cells(1, 3) = re.Execute(s)(0).SubMatches(0)
End With
End Sub
Regex:

Excel VBA source code for extracting data from a URL

I want to extract the title of every news item displayed on "http://pib.nic.in/newsite/erelease.aspx?relid=58313" website using Excel VBA. I have written a code using getelementsbyclassname("contentdiv"). But the debugger is showing a error pertaining to that the object doesn't support...I want to extract the information items of every relid..which is there in the URL as well...

Cold scrapes like this are generally handled more efficiently with a XMLHTTP pull. This requires the addition of a few libraries to the VBE's Tools ► References. The code below needs Microsoft XML, v6.0, Microsoft HTML Object library and Microsoft Internet Controls. Might not need the last one but you probably will if you expand the code beyond what is supplied.
Public Const csURL As String = "http://pib.nic.in/newsite/erelease.aspx?relid=×ID×"
Sub scrape_PIBNIC()
Dim htmlBDY As HTMLDocument, xmlHTTP As MSXML2.ServerXMLHTTP60
Dim i As Long, u As String, iDIV As Long
On Error GoTo CleanUp
Set xmlHTTP = New MSXML2.ServerXMLHTTP60
Set htmlBDY = New HTMLDocument
For i = 58313 To 58313
htmlBDY.body.innerHTML = vbNullString
With xmlHTTP
u = Replace(csURL, "×ID×", i)
'Debug.Print u
.Open "GET", u, False
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded; charset=UTF-8"
.send
If .Status <> 200 Then GoTo CleanUp
htmlBDY.body.innerHTML = .responseText
For iDIV = 0 To (htmlBDY.getElementsByClassName("contentdiv").Length - 1)
If CBool(htmlBDY.getElementsByClassName("contentdiv")(iDIV).getElementsByTagName("span").Length) Then
Sheets("Sheet1").Cells(Rows.Count, 1).End(xlUp).Offset(1, 0) = _
htmlBDY.getElementsByClassName("contentdiv")(iDIV).getElementsByTagName("span")(0).innerText
End If
Next iDIV
End With
Next i
CleanUp:
Set htmlBDY = Nothing
Set xmlHTTP = Nothing
End Sub
That should be enough to get you started. The site you are targeting requires that charset=UTF-8 be added to the request. I had no success without it. I strongly suspect that this may have been the source of your object doesn't support error.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scrape data that is not in the source code, using VBA - excel

Related

Data values of uploaded images

Pulling text from website into Excel by Using VBA

Can MSXML2.XMLHTTP be used with Chrome

How to pull image from Amazon UK

Excel VBA source code for extracting data from a URL

Categories

Resources