How to pull image from Amazon UK - excel

I am trying to download the hi-res pictures from Amazon.co.uk. I tried the code given in the
[thread][1] and I am getting some issues.
the code by #QHarr works well for given Amazon.in website but when I try for Amazon.co.uk the
.querySelector("#landingImage").getAttribute("data-old-hires") returns nothing. here is the code I am testing.
Public Sub GetInfo()
Dim Html As HTMLDocument, results()
Set Html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.amazon.co.uk/dp/product/B01GFJWHZ0", True
.send
Html.body.innerHTML = .responseText
With ThisWorkbook.Worksheets("Sheet1")
.Cells(1, 2) = Html.querySelector("#productTitle").innerText
.Cells(1, 2).Offset(0, 1) = Html.querySelector("#landingImage").getAttribute("data-old-hires")
End With
End With
End Sub
any idea what exactly I am doing wrong?

My guess is the response html is mangled. You can easily regex out that bit from the response string however.
Note that the async argument is also set to False to allow time for data to load. I'm somewhat surprised your code got as far as you said it did. It should also have tripped over on the fact there is no match for that selector in the HTMLDocument.
Public Sub GetInfo()
'tools > references > Microsoft HTML Object Library
Dim re As Object, html As MSHTML.HTMLDocument, xhr As Object, s As String
Set re = CreateObject("VBScript.RegExp")
Set xhr = CreateObject("MSXML2.XMLHTTP")
Set html = New MSHTML.HTMLDocument
re.Pattern = """hiRes"":""(.*?)"""
With xhr
.Open "GET", "https://www.amazon.co.uk/dp/product/B01GFJWHZ0", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
html.body.innerHTML = s
End With
With ThisWorkbook.Worksheets("Sheet1")
.Cells(1, 2) = html.querySelector("#productTitle").innerText
.Cells(1, 3) = re.Execute(s)(0).SubMatches(0)
End With
End Sub
Regex:

Related

How can we use http.Open "GET" to list items from a table in HTML?

I'm testing an idea that I had. It seems like I should be able to scrape out various HTML elements from a table in a website, but my code can't seem to find the table, which definitely seems to be there.
Sub TryThis()
Dim oHtml As HTMLDocument
Dim oElement As Object
Set oHtml = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population", False
.send
oHtml.body.innerHTML = .responseText
End With
Set myitem = oHtml.getElementsByClassName("wikitable sortable jquery-tablesorter")
i = 0
For Each oElement In myitem
Sheets("Sheet1").Range("A" & i + 1) = myitem(i).innerText
i = i + 1
Next oElement
End Sub
Essentially, I would like to loop through HTML items, print out, in cells, what is in the table named 'wikitable sortable jquery-tablesorter' Here is a screen shot that may help.
You were really close, I think the issue is the jquery-tablesorter class is being added by jQuery (or plugin) after the page is loaded via JS. So that class isn't present in the DOM when the content is pulled in by the web request, it's added after. So removing it from the search criteria, should fix the issue.
Here's what I came up to address this, and also move the table contents over a bit quicker. I just did the first instance of wikitable sortable classes, but should be possible to loop each table too.
Sub TryThis()
Dim oHtml As HTMLDocument
Dim oElement As Object
Dim htmlText As String
Set oHtml = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population", False
.send
oHtml.body.innerHTML = .responseText
End With
htmlText = oHtml.getElementsByClassName("wikitable sortable")(0).outerhtml
With CreateObject("new:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}") 'Clipboard
.SetText htmlText
.PutInClipboard
Sheets(1).Range("A1").Select
Sheets(1).PasteSpecial Format:="Unicode Text"
End With
End Sub

Problems While Scraping with VBA - VBA Web Scraping

I'm trying to get a number from a static page in a website, but when I do the HTML request, the result is a strange HTML without the informations of the original html that I want.
The website that I'm trying to get the information is:
https://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento?id=233361&cvm=true
but I can get the same result with:
https://fnet.bmfbovespa.com.br/fnet/publico/visualizarDocumento?id=233361&cvm=true
The number that I want to get is the number "0,05" in the page
My code is:
Sub trying()
Dim html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento?id=233361&cvm=true&", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT" 'to deal with potential caching
.send
html.body.innerHTML = .responseText
End With
Set element = html.getElementsByTagName("td")(31).innerText
Sheets("Sheet1").Cells(1, 1) = element
End Sub
I have also tried to do that using the InternetExplorer.Application but the the problem keeps the same
After trying a bunch of request headers, Accept request header is required to return the response in HTML:
Sub trying()
Dim html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "http://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento?id=233361&cvm=true&", False
.setRequestHeader "Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
.send
html.body.innerHTML = .responseText
End With
Dim element As String
element = html.getElementsByTagName("td")(32).innerText
Sheets("Sheet1").Cells(1, 1) = element
End Sub
Just add a .htm (or .html) extension to the request to specify file type wanted.
Option Explicit
Public Sub trying()
Dim html As MSHTML.HTMLDocument
Set html = New MSHTML.HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento.htm?id=233361", False
.send
html.body.innerHTML = .responseText
End With
Debug.Print html.querySelector("tr:nth-child(6) .dado-valores").innerText
End Sub
Alternatives, that are a little more effort, include adding the accept header or base64 decoding the .responseText before writing to .innerHTML.
If you went down the less preferable base64 decode route, using function from here (note: pretty sure this is not the original source), then you will need to amend the following line:
.DataType = "bin.base64": .Text = Replace$(b64, Chr$(34), vbNullString) 'modified line
Public Function DecodeBase64(b64$)
Dim b
With CreateObject("Microsoft.XMLDOM").createElement("b64")
.DataType = "bin.base64": .Text = Replace$(b64, Chr$(34), vbNullString) 'modified line
b = .nodeTypedValue
With CreateObject("ADODB.Stream")
.Open: .Type = 1: .Write b: .Position = 0: .Type = 2: .Charset = "utf-8"
DecodeBase64 = .ReadText
.Close
End With
End With
End Function

Concatenate referenced URL into XML HTTP Request

The following snippet of code sends a XML request to the following site
Sub GetContents()
Dim XMLReq As New MSXML2.XMLHTTP60
XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.028.723", False
XMLReq.send
End Sub
I have another Sub routine GetURL() which prints out the desired URL in this case: https://echa.europa.eu/brief-profile/-/briefprofile/100.028.723
How can I essentially concatenate the output of GetURL() into the BstrUrl? i.e.
XMLReq.Open "Get", "x", False where x is the output of GetURL()
Despite various attempts the syntax is not accepted as a URL.
Assuming you are combining from your earlier question then you need to ensure you write a function which returns the url (as Tim Williams has pointed out). I would expand upon this, in that I think you would need to consider adding a test to ensure both the request succeeded, there were results, and to pass the searchKeyWord as an argument to make your function more reusable. Along the same lines, you could pass the xmlhttp object into the function, so as to avoid continually creating and destroying them.
Avoid auto-instantiation, as you can get unexpected results, and Hungarian style notation. Personally, I also avoid those type characters, as they are harder to read.
vbNullString will offer faster assignment than = "".
I would also use a shorter, faster, and more reliable css pattern to retrieve the url, based on classes and a parent child relationship of two elements.
Public Sub GetContents()
Dim searchKeyWord As String, xmlReq As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument, url As String
searchKeyWord = "Acetone"
Set xmlReq = New MSXML2.XMLHTTP60
url = GetUrl(searchKeyWord, xmlReq)
Set html = New MSHTML.HTMLDocument
If url <> "N/A" Then
With xmlReq
.Open "GET", url, False
.send
If .Status = 200 Then
html.body.innerHTML = .responseText
Debug.Print html.querySelector("title").innerText
End If
End With
End If
End Sub
Public Function GetUrl(ByVal searchKeyWord As String, ByVal http As MSXML2.XMLHTTP60) As String
Const url = "https://echa.europa.eu/search-for-chemicals?p_auth=5ayUnMyz&p_p_id=disssimplesearch_WAR_disssearchportlet&p_p_lifecycle=1&p_p_state=normal&p_p_col_id=_118_INSTANCE_UFgbrDo05Elj__column-1&p_p_col_count=1&_disssimplesearch_WAR_disssearchportlet_javax.portlet.action=doSearchAction&_disssimplesearch_WAR_disssearchportlet_backURL=https%3A%2F%2Fecha.europa.eu%2Finformation-on-chemicals%3Fp_p_id%3Ddisssimplesearchhomepage_WAR_disssearchportlet%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3D_118_INSTANCE_UFgbrDo05Elj__column-1%26p_p_col_count%3D1%26_disssimplesearchhomepage_WAR_disssearchportlet_sessionCriteriaId%3D"
Dim html As MSHTML.HTMLDocument, dict As Object, i As Long, r As Long
Dim dictKey As Variant, payload$, ws As Worksheet
Set html = New MSHTML.HTMLDocument
Set dict = CreateObject("Scripting.Dictionary")
Set ws = ThisWorkbook.Worksheets("Sheet1")
dict("_disssimplesearchhomepage_WAR_disssearchportlet_formDate") = "1621017052777" 'timestamp
dict("_disssimplesearch_WAR_disssearchportlet_searchOccurred") = "true"
dict("_disssimplesearch_WAR_disssearchportlet_sskeywordKey") = searchKeyWord
dict("_disssimplesearchhomepage_WAR_disssearchportlet_disclaimer") = "true"
dict("_disssimplesearchhomepage_WAR_disssearchportlet_disclaimerCheckbox") = "on"
payload = vbNullString
For Each dictKey In dict
payload = IIf(Len(dictKey) = 0, WorksheetFunction.EncodeURL(dictKey) & "=" & WorksheetFunction.EncodeURL(dict(dictKey)), _
payload & "&" & WorksheetFunction.EncodeURL(dictKey) & "=" & WorksheetFunction.EncodeURL(dict(dictKey)))
Next dictKey
With http
.Open "POST", url, False
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36"
.setRequestHeader "Content-type", "application/x-www-form-urlencoded"
.send (payload)
If .Status = 200 Then
html.body.innerHTML = .responseText
Else
GetUrl = "N/A"
Exit Function
End If
End With
Dim result As Boolean
result = html.querySelectorAll(".lfr-search-container .substanceNameLink").Length > 0
GetUrl = IIf(result, html.querySelector(".lfr-search-container .substanceNameLink").href, "N/A")
End Function
If GetURL is a function returning a string then this should work:
Sub GetContents()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim url
url = GetURL()
XMLReq.Open "Get", url, False
XMLReq.send
End Sub

Get custom element via screenscraping with VBA

I would like to screenscrape some prices from yahoo finance for some stocks in my excel sheet.
My approach is to use:
Function Scrape()
Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
With appIE
.Navigate "https://de.finance.yahoo.com/quote/TSLA"
.Visible = True
End With
Do While appIE.Busy
DoEvents
Loop
Set data = appIE.document.getElementById("data-reactid") #this is the point where I'm stuck
End Function
The question I have is how to get the custom elements such as:
<span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid="32">1.025,05</span>
The site seems to use reactid for every element which makes it easy to pinpoint the elements. How would I go by doing that for the above example data-reactid="32"
Thanks
You can try using xhr because the content you look for is available in the page source. This is one of the efficient ways how you can go:
Public Sub GetPrice()
Const Url$ = "https://de.finance.yahoo.com/quote/TSLA"
Dim S$, itemPrice$
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", Url, False
.send
S = .responseText
End With
With CreateObject("HTMLFile")
.write S
itemPrice = .getElementById("quote-market-notice").ParentNode.FirstChild.innerText
MsgBox itemPrice
End With
End Sub

VBA : how to connect MSXML2.XMLHTTP60 response to IHTMLDocument(iframe)

I am trying to use MSXML2 and IHTMLDocument
to deal with iframe part of HTML web page.
I want to use MSXML2 and save it to better "capture" the data, thinking it's faster than just using InternetExplorer or VBA selenium reference supported by VBA menu.
(I don't want to avoid using IE or selenium as much as possible)
But I couldn't find out how to save document as XML format(to take advantage of its speed) and at the same time click on the element in the document without the help of browser(ie or selenium).
And even after clicking some tab(id="cns_Tab21") on this web page, I have difficulty retrieving data.
So my question is..
1> Is it possible to minimize the use of browser for clicking?
2> Even after clicking(using Selenium), it throws an xpath related error in VBA editor.
Thank you for your answer in advance and the URL used for this is
http://bitly.kr/finance
and the iframe inside the link is http://bitly.kr/LT0aCb
'I declared objects
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim iframeDoc As IHTMLDocument
'and saved XML data to HTML format
HTMLDoc.body.innerHTML = XMLReq.responseText
'and trying to save this HTML to iframe...
Set iframeDoc = HTMLDoc.getElementById("coinfo_cp")
'I tried .contentDocument but it maybe HTMLdoc doesn't have this property.
and I don't know how to access information I saved to iframeDoc above.
'And after I use Selenium I can't figure out why it throw an error
For Each ele In selenium.FindElementsByTag("th")
If ele.Attribute("innerText") = "CAPEX" Then
Debug.Print ele.FindElementsByXPath("./../td").Attribute("innerText")
This post isn't a duplicate since I am trying to use XML to handle iframe element and without InternetExplorer reference in VBA Excel.(ie.document)
You can make replicate the xhr request the page makes when that tab (not iframe) is selected. I use clipboard to copy table to Excel. Note: url I am using is from our discussions. This info should be reflected in question.
Option Explicit
Public Sub GetTable()
'VBE > Tools > References > Microsoft HTML Object Library
Dim html As HTMLDocument, hTable As HTMLTable, clipboard As Object
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://navercomp.wisereport.co.kr/v2/company/ajax/cF1001.aspx?cmp_cd=005930&fin_typ=0&freq_typ=Y&encparam=ZXR1cWFjeGJnS1lWOHhCYmNScmJXUT09&id=bG05RlB6cn", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
html.body.innerHTML = .responseText
End With
Set hTable = html.querySelector(".hbG05RlB6cn + .gHead01")
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}") ' New DataObject
clipboard.SetText hTable.outerHTML
clipboard.PutInClipboard
ThisWorkbook.Worksheets("Sheet1").Cells(1, 1).PasteSpecial
End Sub
You can find the params of the ajax url for the tab content update in the scripts of the page
Along with the target for the update:
This needs tidying up:
Option Explicit
Public Sub GetTable()
'https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=005930
'VBE > Tools > References > Microsoft HTML Object Library
Dim html As HTMLDocument, hTable As HTMLTable, clipboard As Object, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://navercomp.wisereport.co.kr/v2/company/ajax/cF1001.aspx?cmp_cd=005930&fin_typ=0&freq_typ=Y&encparam=ZXR1cWFjeGJnS1lWOHhCYmNScmJXUT09&id=bG05RlB6cn", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
html.body.innerHTML = .responseText
End With
Set hTable = html.querySelector(".hbG05RlB6cn + .gHead01") '2nd tab. CAPEX row
Dim html2 As HTMLDocument, i As Long
Set html2 = New HTMLDocument
html2.body.innerHTML = hTable.outerHTML
Dim tableBodyRows As Object, tableBodyRowLength As Long, tableHeaderRowLength As Long, tableHeaderRows As Object, targetRow As Long
Set tableBodyRows = html2.querySelectorAll("tbody tr .bg")
tableBodyRowLength = tableBodyRows.Length
tableHeaderRowLength = html2.querySelectorAll("thead tr").Length + 2
For i = 0 To tableBodyRowLength - 1
If Trim$(tableBodyRows.item(i).innerText) = "CAPEX" Then
targetRow = i + tableHeaderRowLength + 1
Exit For
End If
Next
Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}") ' New DataObject
clipboard.SetText hTable.outerHTML
clipboard.PutInClipboard
ws.Cells(1, 1).PasteSpecial
Dim unionRng As Range
For i = (tableHeaderRowLength + 1) To (tableBodyRowLength + tableHeaderRowLength)
If i <> targetRow Then
If Not unionRng Is Nothing Then
Set unionRng = Union(ws.rows(i), unionRng)
Else
Set unionRng = ws.rows(i)
End If
End If
Next
If Not unionRng Is Nothing Then unionRng.Delete
End Sub

Resources