Problems While Scraping with VBA - VBA Web Scraping - excel

I'm trying to get a number from a static page in a website, but when I do the HTML request, the result is a strange HTML without the informations of the original html that I want.
The website that I'm trying to get the information is:
https://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento?id=233361&cvm=true
but I can get the same result with:
https://fnet.bmfbovespa.com.br/fnet/publico/visualizarDocumento?id=233361&cvm=true
The number that I want to get is the number "0,05" in the page
My code is:
Sub trying()
Dim html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento?id=233361&cvm=true&", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT" 'to deal with potential caching
.send
html.body.innerHTML = .responseText
End With
Set element = html.getElementsByTagName("td")(31).innerText
Sheets("Sheet1").Cells(1, 1) = element
End Sub
I have also tried to do that using the InternetExplorer.Application but the the problem keeps the same

After trying a bunch of request headers, Accept request header is required to return the response in HTML:
Sub trying()
Dim html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "http://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento?id=233361&cvm=true&", False
.setRequestHeader "Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
.send
html.body.innerHTML = .responseText
End With
Dim element As String
element = html.getElementsByTagName("td")(32).innerText
Sheets("Sheet1").Cells(1, 1) = element
End Sub

Just add a .htm (or .html) extension to the request to specify file type wanted.
Option Explicit
Public Sub trying()
Dim html As MSHTML.HTMLDocument
Set html = New MSHTML.HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento.htm?id=233361", False
.send
html.body.innerHTML = .responseText
End With
Debug.Print html.querySelector("tr:nth-child(6) .dado-valores").innerText
End Sub
Alternatives, that are a little more effort, include adding the accept header or base64 decoding the .responseText before writing to .innerHTML.
If you went down the less preferable base64 decode route, using function from here (note: pretty sure this is not the original source), then you will need to amend the following line:
.DataType = "bin.base64": .Text = Replace$(b64, Chr$(34), vbNullString) 'modified line
Public Function DecodeBase64(b64$)
Dim b
With CreateObject("Microsoft.XMLDOM").createElement("b64")
.DataType = "bin.base64": .Text = Replace$(b64, Chr$(34), vbNullString) 'modified line
b = .nodeTypedValue
With CreateObject("ADODB.Stream")
.Open: .Type = 1: .Write b: .Position = 0: .Type = 2: .Charset = "utf-8"
DecodeBase64 = .ReadText
.Close
End With
End With
End Function

Related

Concatenate referenced URL into XML HTTP Request

The following snippet of code sends a XML request to the following site
Sub GetContents()
Dim XMLReq As New MSXML2.XMLHTTP60
XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.028.723", False
XMLReq.send
End Sub
I have another Sub routine GetURL() which prints out the desired URL in this case: https://echa.europa.eu/brief-profile/-/briefprofile/100.028.723
How can I essentially concatenate the output of GetURL() into the BstrUrl? i.e.
XMLReq.Open "Get", "x", False where x is the output of GetURL()
Despite various attempts the syntax is not accepted as a URL.
Assuming you are combining from your earlier question then you need to ensure you write a function which returns the url (as Tim Williams has pointed out). I would expand upon this, in that I think you would need to consider adding a test to ensure both the request succeeded, there were results, and to pass the searchKeyWord as an argument to make your function more reusable. Along the same lines, you could pass the xmlhttp object into the function, so as to avoid continually creating and destroying them.
Avoid auto-instantiation, as you can get unexpected results, and Hungarian style notation. Personally, I also avoid those type characters, as they are harder to read.
vbNullString will offer faster assignment than = "".
I would also use a shorter, faster, and more reliable css pattern to retrieve the url, based on classes and a parent child relationship of two elements.
Public Sub GetContents()
Dim searchKeyWord As String, xmlReq As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument, url As String
searchKeyWord = "Acetone"
Set xmlReq = New MSXML2.XMLHTTP60
url = GetUrl(searchKeyWord, xmlReq)
Set html = New MSHTML.HTMLDocument
If url <> "N/A" Then
With xmlReq
.Open "GET", url, False
.send
If .Status = 200 Then
html.body.innerHTML = .responseText
Debug.Print html.querySelector("title").innerText
End If
End With
End If
End Sub
Public Function GetUrl(ByVal searchKeyWord As String, ByVal http As MSXML2.XMLHTTP60) As String
Const url = "https://echa.europa.eu/search-for-chemicals?p_auth=5ayUnMyz&p_p_id=disssimplesearch_WAR_disssearchportlet&p_p_lifecycle=1&p_p_state=normal&p_p_col_id=_118_INSTANCE_UFgbrDo05Elj__column-1&p_p_col_count=1&_disssimplesearch_WAR_disssearchportlet_javax.portlet.action=doSearchAction&_disssimplesearch_WAR_disssearchportlet_backURL=https%3A%2F%2Fecha.europa.eu%2Finformation-on-chemicals%3Fp_p_id%3Ddisssimplesearchhomepage_WAR_disssearchportlet%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3D_118_INSTANCE_UFgbrDo05Elj__column-1%26p_p_col_count%3D1%26_disssimplesearchhomepage_WAR_disssearchportlet_sessionCriteriaId%3D"
Dim html As MSHTML.HTMLDocument, dict As Object, i As Long, r As Long
Dim dictKey As Variant, payload$, ws As Worksheet
Set html = New MSHTML.HTMLDocument
Set dict = CreateObject("Scripting.Dictionary")
Set ws = ThisWorkbook.Worksheets("Sheet1")
dict("_disssimplesearchhomepage_WAR_disssearchportlet_formDate") = "1621017052777" 'timestamp
dict("_disssimplesearch_WAR_disssearchportlet_searchOccurred") = "true"
dict("_disssimplesearch_WAR_disssearchportlet_sskeywordKey") = searchKeyWord
dict("_disssimplesearchhomepage_WAR_disssearchportlet_disclaimer") = "true"
dict("_disssimplesearchhomepage_WAR_disssearchportlet_disclaimerCheckbox") = "on"
payload = vbNullString
For Each dictKey In dict
payload = IIf(Len(dictKey) = 0, WorksheetFunction.EncodeURL(dictKey) & "=" & WorksheetFunction.EncodeURL(dict(dictKey)), _
payload & "&" & WorksheetFunction.EncodeURL(dictKey) & "=" & WorksheetFunction.EncodeURL(dict(dictKey)))
Next dictKey
With http
.Open "POST", url, False
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36"
.setRequestHeader "Content-type", "application/x-www-form-urlencoded"
.send (payload)
If .Status = 200 Then
html.body.innerHTML = .responseText
Else
GetUrl = "N/A"
Exit Function
End If
End With
Dim result As Boolean
result = html.querySelectorAll(".lfr-search-container .substanceNameLink").Length > 0
GetUrl = IIf(result, html.querySelector(".lfr-search-container .substanceNameLink").href, "N/A")
End Function
If GetURL is a function returning a string then this should work:
Sub GetContents()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim url
url = GetURL()
XMLReq.Open "Get", url, False
XMLReq.send
End Sub

VBA error 438 object doesn't support this property or method while trying to Web Scraping from a site

I'm with a problem while trying to get elements from a website with VBA, I've searched this problem in StackOverflow but anyone of the answers that I tried have solved my problem.
I want to get the text that are in the element Strong, but this element are into an Div.
Sub StatusInvest()
Dim html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://statusinvest.com.br/acoes/eua/aapl", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT" 'to deal with potential caching
.send
html.body.innerHTML = .responseText
End With
Application.ScreenUpdating = False
Set v = html.getElementsByClass("info special w-100 w-md-33 w-lg-20")(0).Value = "00000"
Set test = valor.getElementsByTagName("Strong").innerText
Range("B1").Value = teste
Application.ScreenUpdating = True
End Sub
This is the element that I'm trying to get:
Sample of the HTML
You got the method name wrong - It should be getElementsByClassName.
getElementsByTagName and getElementsByClassName both returns a collection of element, whether there is one or more so you have to refer to index 0 when you are trying to get strong element.
innertext property returns a string so you shouldn't use Set statement but simply assign test.
Sub StatusInvest()
Dim html As HTMLDocument
Dim valor As Object
Dim test As String
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://statusinvest.com.br/acoes/eua/aapl", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT" 'to deal with potential caching
.send
html.body.innerHTML = .responseText
End With
Application.ScreenUpdating = False
Set valor = html.getElementsByClassName("special")(0)
test = valor.getElementsByTagName("Strong")(0).innerText
Range("B1").Value = test
Application.ScreenUpdating = True
End Sub
In addition to what was mentioned in Raymond's answer, it would be faster, and more robust, to use a single class, via css class selector, to target the desired element. The target element itself has a single class you can use. By using querySelector you stop matching after first match retrieved, rather than returning an entire collection and then indexing.
Use the commented out line if you want to include the currency symbol in front of the value.
Option Explicit
Public Sub StatusInvest()
Dim html As MSHTML.HTMLDocument
Set html = New MSHTML.HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://statusinvest.com.br/acoes/eua/aapl", False
.SetRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT" 'to deal with potential caching
.send
html.body.innerHTML = .responseText
End With
ActiveSheet.Range("B1").Value = html.querySelector(".value").innerText
'ActiveSheet.Range("B1").Value = html.querySelector(".icon").innerText & Chr$(32) & html.querySelector(".value").innerText
End Sub

How to pull image from Amazon UK

I am trying to download the hi-res pictures from Amazon.co.uk. I tried the code given in the
[thread][1] and I am getting some issues.
the code by #QHarr works well for given Amazon.in website but when I try for Amazon.co.uk the
.querySelector("#landingImage").getAttribute("data-old-hires") returns nothing. here is the code I am testing.
Public Sub GetInfo()
Dim Html As HTMLDocument, results()
Set Html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.amazon.co.uk/dp/product/B01GFJWHZ0", True
.send
Html.body.innerHTML = .responseText
With ThisWorkbook.Worksheets("Sheet1")
.Cells(1, 2) = Html.querySelector("#productTitle").innerText
.Cells(1, 2).Offset(0, 1) = Html.querySelector("#landingImage").getAttribute("data-old-hires")
End With
End With
End Sub
any idea what exactly I am doing wrong?
My guess is the response html is mangled. You can easily regex out that bit from the response string however.
Note that the async argument is also set to False to allow time for data to load. I'm somewhat surprised your code got as far as you said it did. It should also have tripped over on the fact there is no match for that selector in the HTMLDocument.
Public Sub GetInfo()
'tools > references > Microsoft HTML Object Library
Dim re As Object, html As MSHTML.HTMLDocument, xhr As Object, s As String
Set re = CreateObject("VBScript.RegExp")
Set xhr = CreateObject("MSXML2.XMLHTTP")
Set html = New MSHTML.HTMLDocument
re.Pattern = """hiRes"":""(.*?)"""
With xhr
.Open "GET", "https://www.amazon.co.uk/dp/product/B01GFJWHZ0", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
html.body.innerHTML = s
End With
With ThisWorkbook.Worksheets("Sheet1")
.Cells(1, 2) = html.querySelector("#productTitle").innerText
.Cells(1, 3) = re.Execute(s)(0).SubMatches(0)
End With
End Sub
Regex:

Print title importing from one location to another

I've created a vba script to parse the title of diffetent posts along with the editing status of those posts from a website. What I wish to do now is let my script parse the title from it's landing page but print the title at the same time when it will print the editing status. I do not wish to create two subs for this task. I do not even know if it is possible in vba. However, if anything unclear please check out the comment within my script.
Sub ImportTitleFromAnotherLocation()
Const LINK$ = "https://stackoverflow.com/questions/tagged/web-scraping"
Const prefix$ = "https://stackoverflow.com"
Dim Http As New XMLHTTP60, Html As New HTMLDocument
Dim editInfo As Object, I&, targetUrl$, postTile$
With Http
.Open "GET", LINK, False
.send
Html.body.innerHTML = .responseText
End With
With Html.querySelectorAll(".summary .question-hyperlink")
For I = 0 To .Length - 1
postTitle = .item(I).innerText 'I like this line to be transferred to the location below
targetUrl = Replace(.item(I).getAttribute("href"), "about:", prefix)
With Http
.Open "GET", targetUrl, False
.send
Html.body.innerHTML = .responseText
End With
R = R + 1: Cells(R, 1) = postTitle 'here I wish to use the above line like this
Set editInfo = Html.querySelector(".user-action-time > a")
If Not editInfo Is Nothing Then
Cells(R, 2) = editInfo.innerText
End If
Next I
End With
End Sub
You are overwriting your html document in the loop. A simple way would be to use a second htmldocument variable. A more verbose way would be to store the titles before the loop, for example in an array during an additional loop, then use your i variable to index into that to retrieve each title during the existing loop.
Sub ImportTitleFromAnotherLocation()
Const LINK$ = "https://stackoverflow.com/questions/tagged/web-scraping"
Const prefix$ = "https://stackoverflow.com"
Dim Http As New XMLHTTP60, Html As New HTMLDocument, Html2 As New HTMLDocument
Dim editInfo As Object, I&, targetUrl$, postTile$
Dim postTitle As String, r As Long
With Http
.Open "GET", LINK, False
.send
Html.body.innerHTML = .responseText
End With
With Html.querySelectorAll(".summary .question-hyperlink")
For I = 0 To .Length - 1
postTitle = .item(I).innerText 'I like this line to be transferred to the location below
targetUrl = Replace$(.item(I).getAttribute("href"), "about:", prefix)
With Http
.Open "GET", targetUrl, False
.send
Html2.body.innerHTML = .responseText
End With
r = r + 1: ActiveSheet.Cells(r, 1) = postTitle 'here I wish to use the above line like this
Set editInfo = Html2.querySelector(".user-action-time > a")
If Not editInfo Is Nothing Then
ActiveSheet.Cells(r, 2) = editInfo.innerText
End If
Next I
End With
End Sub

getElementsBy() extract text

I'm really new to VBA and I've been trying to get the value below the Column "Impuesto".
I'm getting error 438. I still don't quite understand how to refer to a certain part of the web page.
Sub extract()
Dim myIE As Object
Dim myIEDoc As Object
Dim element As IHTMLElement
Set myIE = CreateObject("InternetExplorer.Application")
myIE.Visible = False
myIE.navigate "https://zonasegura1.bn.com.pe/TipoCambio/"
While myIE.Busy
DoEvents
Wend
Set myIEDoc = myIE.document
Range("B1") = myIEDoc.getElementsByID("movimiento")(0).getElementsByTagName("span")
End Sub
You need getElementsByClassName() not getElementsByID since the word movimiento is in <li class="movimiento bg"> Impuesto </li>
Range("B1") = myIEDoc.getElementsByClassName("movimiento")(0).getElementsByClassName("l2 valor")(0)
Edit:
Check out the tag if the tag name if <li>..</li> so you should getElementsByTagName("li")
Check out the tag if the tag contain id <li id="movimiento">..</li> so you should getElementByID("movimiento")
Check out the tag if the tag contain class <li class="movimiento">..</li> so you should getElementsByClassName("movimiento")
Try the below script. It should fetch you the data you are after. When the execution is done, you should find the value in Range("A1") in your spreadsheet.
Sub Get_Quote()
Dim post As Object
With CreateObject("InternetExplorer.Application")
.Visible = True
.navigate "https://zonasegura1.bn.com.pe/TipoCambio/"
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set post = .document.querySelector(".movimiento span.l2.valor")
[A1] = post.innerText
.Quit
End With
End Sub
It is faster to use XMLHTTP request as follows:
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, html As HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://zonasegura1.bn.com.pe/TipoCambio/", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Set html = New HTMLDocument
With html
.body.innerHTML = sResponse
Debug.Print .querySelector(".movimiento .l2.valor").innerText
End With
End Sub

Resources