Scrape with xmlhttp - excel

I would like to get data from https://www.goaloong.net/football/6in1
This page contains a table.
I tried with:
Sub REQUESTXML()
Dim XMLHttpRequest As xmlHttp
Dim HTMLDoc As New HTMLDocument
Dim elem As Object
Dim x As Long
Set XMLHttpRequest = New MSXML2.xmlHttp
XMLHttpRequest.Open "GET", "https://www.goaloong.net/football/6in1", False
XMLHttpRequest.send
While XMLHttpRequest.readyState = 200
DoEvents
Wend
Debug.Print XMLHttpRequest.responseText
HTMLDoc.Body.innerHTML = XMLHttpRequest.responseText
x = 1
For Each elem In HTMLDoc.getElementsByClassName("Leaguestitle")
Sheets("req").Range("A" & x).Value = HTMLDoc.getElementsByTagName("a")(0).innerText
x = x + 1
Next elem
End Sub
I have no result.
Kindly help me?

The page https://www.goaloong.net/football/6in1 is dynamic, i.e. first the java scripts are loaded, then the scripts are loading the content. One approach is to load the full page content in IE and get it out of it. Example below (tested):
Sub REQUESTXML()
Dim IE As New InternetExplorer
Dim elem As Object
Dim x As Long
IE.navigate "https://www.goaloong.net/football/6in1"
Do While IE.readyState = READYSTATE_COMPLETE: DoEvents: Loop
Do Until IE.readyState = READYSTATE_COMPLETE: DoEvents: Loop
'for debug purpose
Open ThisWorkbook.Path & "\TESTFILE.html" For Output As #1
Print #1, IE.document.body.innerHTML
Close #1
x = 1
For Each elem In IE.document.getElementsByClassName("Leaguestitle")
Sheets(1).Range("A" & x).Value = elem.innerText
x = x + 1
Next elem
IE.Quit
End Sub

If you're ok with using a DLL and rewrite your code, you can run Microsoft's Edge browser (a Chrome-based browser) with VBA. With that you can do almost anything you want. Note however, that access to the DOM is performed by javascript, not by an object like
Dim IE As New InternetExplorer. Look at the VBA sample and you'll get the grasp.
https://github.com/peakpeak-github/libEdge
Sidenote: Samples for C# and C++ are also included.

Related

Webscraping of product prices and specs

I'm new to the VBA world and i was trying to create a Excel file to get the cheapest price and name tag on a website. I've created one file that goes through the whole search list and gathers each product URL and places on a spreadsheet. My challenge now is in making this second code work. It starts out getting the information without any issues but then, after 10 or more URLs the macro gives a bug and it starts repeating the information on all the following cells.. Is there a way i can make the code run slower so it doesn't get into this problem?
I'll list here my code and a sample of the URLs im scraping..
here's the code i've been using:
Sub test()
Dim URL As String
Set ie = CreateObject("internetexplorer.application")
For i = 2 To 300
URL = Cells(i, 1).Value
ie.navigate URL
ie.Visible = False
Do While ie.busy And ie.readystate <> "readystate_complete"
DoEvents
Loop
Cells(i, 3) = ie.document.getElementsByTagName("h1")(0).innerText
Cells(i, 4) = ie.document.getElementsByTagName("strong")(0).innerText
Next i
ie.Quit
MsgBox "acabou"
End Sub
<https://www.zoom.com.br/notebook/notebook-vaio-vjf157f11x-b0211s-intel-core-i5-8250u-15-6-8gb-ssd-256-gb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-samsung-chromebook-plus-intel-celeron-3965y-12-2-4gb-emmc-32-gb-chrome-os-touchscreen>
<https://www.zoom.com.br/notebook/notebook-dell-xps-7390-intel-core-i7-10710u-13-3-16gb-ssd-512-gb-windows-10-touchscreen>
<https://www.zoom.com.br/notebook/notebook-dell-i15-3583-a5-intel-core-i7-8565u-15-6-8gb-hd-2-tb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-lenovo-b330-intel-core-i5-8250u-15-6-4gb-hd-1-tb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-dell-i15-7580-a20-intel-core-i7-8550u-15-6-8gb-hd-1-tb-geforce-mx150-windows-10>
<https://www.zoom.com.br/notebook/notebook-dell-i14-3480-u30-intel-core-i5-8265u-14-4gb-hd-1-tb-linux-8-geracao>
<https://www.zoom.com.br/notebook/macbook-pro-apple-muhn2bz-intel-core-i5-13-3-8gb-ssd-128-gb-tela-de-retina>
<https://www.zoom.com.br/notebook/notebook-multilaser-pc150-amd-a4-9120-14-2gb-emmc-32-gb-windows-10>
<https://www.zoom.com.br/notebook/notebook-samsung-np930qaa-kw1br-intel-core-i7-8550u-13-3-8gb-ssd-256-gb-windows-10-touchscreen>
<https://www.zoom.com.br/notebook/notebook-acer-a515-51g-58vh-intel-core-i5-7200u-15-6-8gb-hd-1-tb-geforce-940mx>
<https://www.zoom.com.br/notebook/notebook-multilaser-pc222-intel-celeron-dual-core-13-3-4gb-emmc-64-gb-windows-10>
<https://www.zoom.com.br/notebook/notebook-acer-pt515-51-788a-intel-core-i7-9750h-15-6-32gb-ssd-1-tb-geforce-rtx-2070-windows-10>
<https://www.zoom.com.br/notebook/notebook-acer-a315-53-53ak-intel-core-i5-7200u-15-6-4gb-hd-1-tb-windows-10-7-geracao>
<https://www.zoom.com.br/notebook/notebook-dell-i15-5584-m40-intel-core-i7-8565u-15-6-8gb-hd-2-tb-geforce-mx130-windows-10>
<https://www.zoom.com.br/notebook/notebook-acer-a315-41g-r21b-amd-ryzen-5-2500u-15-6-8gb-hd-1-tb-radeon-535-windows-10>
<https://www.zoom.com.br/notebook/notebook-positivo-master-n2140-intel-core-i3-7020u-14-4gb-hd-500-gb-windows-10-7-geracao>
<https://www.zoom.com.br/notebook/notebook-multilaser-pc101-intel-atom-14-1gb-ssd-32-gb-windows-10>
<https://www.zoom.com.br/notebook/notebook-lenovo-b330-intel-core-i5-8250u-15-6-8gb-hd-1-tb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-acer-an515-51-77fh-intel-core-i7-7700hq-15-6-8gb-hd-1-tb-geforce-gtx-1050-windows-10>
<https://www.zoom.com.br/notebook/notebook-dell-i15-3583-a2yp-intel-core-i5-8265u-15-6-4gb-optane-16-gb-hd-1-tb-windows-10>
<https://www.zoom.com.br/notebook/notebook-asus-g531gt-intel-core-i7-9750h-15-6-16gb-ssd-512-gb-geforce-gtx-1650-windows-10>
<https://www.zoom.com.br/notebook/notebook-vaio-fit-15s-intel-core-i3-7100u-15-6-4gb-hd-1-tb-windows-10-home>
<https://www.zoom.com.br/notebook/notebook-samsung-s50-intel-core-i7-7500u-13-3-8gb-ssd-256-gb-windows-10-style>
<https://www.zoom.com.br/notebook/notebook-lenovo-b330-intel-core-i3-7020u-15-6-4gb-ssd-120-gb-windows-10-7-geracao>
First of all:
Always declare all variables. To force this, always write Option Explicit as the first line in each module. This way, especially typos in variable names are immediately detected by the compiler.
Here is what to do about your problem:
IE is sometimes a real diva. For example, it doesn't like to have to process URLs in the same instance in quick succession. Therefore it is advisable to kick it out of memory and restart it for each new URL.
To restart it quickly, the deletion of coockies, the cache, etc. must not be set in its settings. Otherwise automation errors will occur.
Try this macro. With the given URLs it works:
Option Explicit
Sub test()
Dim URL As String
Dim ie As Object
Dim i As Long
For i = 2 To 300
If i > 14 Then
ActiveWindow.SmallScroll down:=1
End If
URL = ActiveSheet.Cells(i, 1).Value
Set ie = CreateObject("internetexplorer.application")
ie.navigate URL
ie.Visible = False
Do While ie.readystate <> 4: DoEvents: Loop
ActiveSheet.Cells(i, 3) = ie.document.getElementsByTagName("h1")(0).innerText
ActiveSheet.Cells(i, 4) = ie.document.getElementsByTagName("strong")(0).innerText
ie.Quit
Set ie = Nothing
Next i
MsgBox "acabou"
End Sub
Try this
Sub GetPrices()
Dim html As MSHTML.HTMLDocument, r As Long
For r = 1 To 4
Set html = GetHTML(Cells(r, 1).Value)
Cells(r, 3).Value = html.querySelector("h1.product-name").innerText
Cells(r, 4).Value = Replace(Replace(html.querySelector(".product-price").innerText, "a partir de ", vbNullString), ":( ", "")
Set html = Nothing
Next r
End Sub
Function GetHTML(ByVal sURL As String) As HTMLDocument
Dim http As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument
Set http = New MSXML2.XMLHTTP60
Set html = New MSHTML.HTMLDocument
With http
.Open "Get", sURL, False
.send
html.body.innerHTML = .responseText
End With
Set GetHTML = html
End Function
In python I will do :
You should first import library
from time import sleep
Two functions available :
this function let you sleep before execute every 3 seconds
time.sleep(3)
this function let you random sleep between every execute from 1 to 3 seconds
sleep(randint(1, 3)
Note: take aware about :
cookies because sometimes you need to post request to scrape some id before get.
the syntax of your header and set correctly origin & referrer parameters.

VBA HTML elements to Excel

I am working on a code that uses VBA-Excel to navigate to a website and copy some values to Excel.
I can open the website and navigate, but I can't save the "Precipitation" values in excel sheet
Sub accuweather()
Dim ie As InternetExplorer
Dim pagePiece As Object
Dim webpage As HTMLDocument
Set ie = New InternetExplorer
ie.Visible = True
ie.navigate ("http://www.accuweather.com/en/pt/abadia/869773/daily-weather-forecast/869773?day=2")
Do While ie.readyState = 4: DoEvents: Loop
Do Until ie.readyState = 4: DoEvents: Loop
While ie.Busy
DoEvents
Wend
Set webpage = ie.document
Set mtbl = webpage.getElementsByTagName("details-card card panel details allow-wrap")
Set table_data = mtbl.getElementsByTagName("div")(1)
For itemNum = 1 To 240
For childNum = 0 To 5
Cells(itemNum, childNum + 1) = table_data.Item(itemNum).Children(childNum).innerText
Next childNum
Next itemNum
ie.Quit
Set ie = Nothing
End Sub
The method you are using is getElementsByTagName but the reference is for a multi-valued class. So the correct method would be getElementsByClassName.
However, you don't need the browser as that content is static and you can just use faster xmlhttp request and a single (more robust and faster) class to target.
This
html.querySelectorAll(".list")
is retrieving the two parent nodes which have the various p tag children. The first child in both cases
.Item(i).FirstChild
is the precipitation info.
Option Explicit
Public Sub GetPrecipitationValues()
Dim html As MSHTML.HTMLDocument, i As Long
Set html = New MSHTML.HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.accuweather.com/en/pt/abadia/869773/daily-weather-forecast/869773?day=2", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
html.body.innerHTML = .responseText
End With
With html.querySelectorAll(".list")
For i = 0 To .Length - 1
Debug.Print .Item(i).FirstChild.innerText
Next
End With
End Sub

When the search button is clicked using vba the text entered in search box is not seen by web page

I have written vba code for entering manufacturer part number in search box of below website and clicking on search icon. It is able enter manufacturer part number in search box and click on search icon, but when "search icon is clicked the text entered in the text box is not picked up". It searches empty data.
'HTML Part for search icon
<em class="fa fa-search" aria-hidden="true" style="color: gray;"></em>
It being almost a month I have tried various different way which was also mentioned on stack overflow, like using "createEvent("keyboardevent")" but nothing worked.
' VBA code
Sub AptivScrapping()
Dim IE As SHDocVw.InternetExplorer
Set IE = New InternetExplorer
IE.Visible = True
IE.navigate "https://ecat.aptiv.com"
Do While IE.readyState < READYSTATE_COMPLETE
Loop
Dim idoc As MSHTML.HTMLDocument
Set idoc = IE.document
idoc.getElementById("searchUserInput").Value = "33188785"
Dim doc_ele As MSHTML.IHTMLElement
Dim doc_eles As MSHTML.IHTMLElementCollection
Set doc_eles = idoc.getElementsByTagName("a")
For Each doc_ele In doc_eles
If doc_ele.getAttribute("ng-click") = "SearchButtonClick(1)" Then
doc_ele.Click
Exit Sub
Else
End If
Next doc_ele
End Sub
The page does an xhr request to retrieve the search results. You can find it in the network tab after clicking submit. This means you can avoid, in this case, the expense of a browser and issue an xhr request. The response is json so you do need a json parser to handle the results.
I would use jsonconverter.bas to parse the json. After installing the code from that link in a standard module called JsonConverter, go to VBE > Tools > References > Add a reference to Microsoft Scripting Runtime
I dimension an array to hold the results. I determine rows from the number of items in the json collection returned and the number of columns from the size of the first item dictionary. I loop the json object, and inner loop the dictionary keys of each dictionary in collection, and populate the array. I write the array out in one go at end which is less i/o expensive.
Option Explicit
Public Sub GetInfo()
Dim json As Object, ws As Worksheet, headers()
Dim item As Object, key As Variant, results(), r As Long, c As Long
Set ws = ThisWorkbook.Worksheets("Sheet1")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://ecat.aptiv.com/json/eCatalogSearch/SearchProducts?filter=All&options=&pageSize=10&search=33188785", False
.send
Set json = JsonConverter.ParseJson(.responseText)("Products")
End With
headers = json.item(1).keys
ReDim results(1 To json.Count, 1 To UBound(headers) + 1)
For Each item In json
r = r + 1: c = 1
For Each key In item.keys
results(r, c) = item(key)
c = c + 1
Next
Next
With ws
.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End With
You can do this instead:
txt = "33188785"
IE.navigate "https://ecat.aptiv.com/feature?search=" & txt
This will take you straight to the Search Result.
Code:
Sub AptivScrapping()
Dim IE As SHDocVw.InternetExplorer
Dim txt As String
Set IE = New InternetExplorer
txt = "33188785"
IE.Visible = True
IE.navigate "https://ecat.aptiv.com/feature?search=" & txt
Do While IE.Busy
Application.Wait DateAdd("s", 1, Now)
Loop
End Sub
This will be faster as You will only have to load one page.
Why that's happening, i am not sure, but seems like the TextBox that is used to input text is not being Activated when adding text automatically to it. It is being activated when we click inside it.
I got the solution for above problem from Mrxel.com below is the link for that post.
https://www.mrexcel.com/forum/excel-questions/1105434-vba-ie-automation-issue-angularjs-input-text-post5317832.html#post5317832
In this case I need to enter the search string character by character and sendKeys and input events inside the loop. Below is the working vba code.
Sub AptivScrapping()
Dim IE As SHDocVw.InternetExplorer
Set IE = New InternetExplorer
IE.Visible = True
IE.navigate "https://ecat.aptiv.com"
Do While IE.readyState < READYSTATE_COMPLETE
Loop
Dim idoc As MSHTML.HTMLDocument
Set idoc = IE.document
IE.document.getElementById("searchUserInput").Focus = True
IE.document.getElementById("searchUserInput").Select
sFieldInput = "33188785"
For s = 1 To Len(sFieldInput)
Application.SendKeys Mid(sFieldInput, s, 1)
While IE.readyState < 4 Or IE.Busy
Application.Wait DateAdd("s", LoopSeconds, Now)
Wend
Next s
IE.document.getElementById("searchUserInput").Focus = False
Dim doc_ele As MSHTML.IHTMLElement
Dim doc_eles As MSHTML.IHTMLElementCollection
Set doc_eles = idoc.getElementsByTagName("a")
For Each doc_ele In doc_eles
If doc_ele.getAttribute("ng-click") = "SearchButtonClick(1)" Then
doc_ele.Click
Exit Sub
Else
End If
Next doc_ele
End Sub

How to click a button in Internet Explorer using VBA

I saw some examples that explain how to click at a button in Internet Explorer by VBA. However, the site that I need to use is not working.
*It did not have an "id". I saw the function querySelector, but it did not work as well.
Site: http://www2.bmf.com.br/pages/portal/bmfbovespa/boletim1/TxRef1.asp
Sub Download()
Dim user, password As Variant
Set IE = CreateObject("InternetExplorer.Application")
IE.navigate "http://www2.bmf.com.br/pages/portal/bmfbovespa/boletim1/TxRef1.asp"
IE.Visible = True
While IE.Busy
DoEvents
Wend
Application.Wait (Now + TimeValue("00:00:02"))
'Preencher o Login e Senha
IE.Document.querySelector("img[src='images/toolbar/b_edit.gif']").Click
End Sub
Your selector is wrong
The html is
<img style="CURSOR:HAND" src="http://www.bmf.com.br/bmfbovespa/images/comum/btoExcel.gif" align="absmiddle" hspace="0" onclick="salvaxls()">
You can use the following attribute = value selector
[onclick='salvaxls()']
You could also use $ ends with operator and target the src
[src$='btoExcel.gif']
Using a proper page load wait you have as follows
Option Explicit
'VBE > Tools > References:
' Microsoft Internet Controls
Public Sub RetrieveInfo()
Dim ie As InternetExplorer
Set ie = New InternetExplorer
With ie
.Visible = True
.Navigate2 "http://www2.bmf.com.br/pages/portal/bmfbovespa/boletim1/TxRef1.asp"
While .Busy Or .readyState < 4: DoEvents: Wend
.document.querySelector("[src$='btoExcel.gif']").Click
Stop
End With
End Sub
There are lots of existing answers on SO regarding how to interact with Save/Open dialog. Personally, I prefer to automate with selenium basic and use Chrome to avoid this issue altogether
I am on a personal campaign to encourage people to use HTTP requests when it's possible, so here's my two cents:
Sub Taxas()
Dim req As New WinHttpRequest
Dim doc As New HTMLDocument
Dim table As HTMLTable
Dim tableRow As HTMLTableRow
Dim reqURL As String
Dim mainURL As String
Dim dateOfInterest As Date
Dim param1 As String
Dim param2 As String
Dim param3 As String
Dim i As Long
dateOfInterest = Date - 1 '11/04/2019 use whichever date you want
param1 = Format(dateOfInterest, "dd/mm/yyyy")
param2 = Format(dateOfInterest, "yyyymmdd")
param3 = "PRE" 'this can be changed according to which element from the drop down list on the top left you need
mainURL = "http://www2.bmf.com.br/pages/portal/bmfbovespa/boletim1/TxRef1.asp"
reqURL = mainURL & "?Data=" & param1 & "&Data1=" & param2 & "&slcTaxa=" & param3
With req
.Open "POST", reqURL, False
.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
.send
doc.body.innerHTML = .responseText
End With
Set table = doc.getElementById("tb_principal1")
i = 1
For Each tableRow In table.Rows
If tableRow.Cells(0).className <> "tabelaTitulo" And tableRow.Cells(0).className <> "tabelaItem" Then
ThisWorkbook.Worksheets(1).Cells(i, "A") = CDbl(Replace((tableRow.Cells(0).innerText), ",", "."))
ThisWorkbook.Worksheets(1).Cells(i, "B") = CDbl(Replace((tableRow.Cells(1).innerText), ",", "."))
ThisWorkbook.Worksheets(1).Cells(i, "C") = CDbl(Replace((tableRow.Cells(2).innerText), ",", "."))
i = i + 1
End If
Next tableRow
End Sub
Make sure you go to VB editor> Tools> References and add Microsoft WinHTTP Services version 5.1 and Microsoft HTML Object Library
With this method you don't need to download an excel file. You get the data right from the source and write in your worksheet.
Study the code, try to learn from it and I promise it will make your life easier in any future web scraping projects.
Cheers

Unable to get rid of blank screen popping up while scraping content

I've created a macro to parse the name of different movies traversing multiple pages from a torrent site. I used InternetExplorer in combination with Html.body.innerHTML parser (usually used with xmlhttp requests) to make the execution faster. Although the content of this site is not dynamic, I used IE to see how it behaves when it does the scraping in combination with Html.body.innerHTML.
When I run my script using IE, It parses the content as expected and finishes the job. The only problem I'm facing is that there is a blank screen out of nowhere popping up in every pagination like the image below.
How can I get rid of this blank screen issues?
My script (no issues when I use Html.body.innerHTML in combination with Chrome Driver):
Sub FetchContent()
Const link = "https://yts.am/browse-movies/0/all/action/0/latest?page="
Dim driver As New ChromeDriver, Html As New HTMLDocument, genre$
Dim post As HTMLDivElement, I&, R&
For I = 1 To 3
With driver
.AddArgument "--headless"
.get link & I
Html.body.innerHTML = .ExecuteScript("return document.documentElement.outerHTML")
End With
For Each post In Html.getElementsByClassName("browse-movie-bottom")
R = R + 1: Sheets(1).Cells(R, 1) = post.getElementsByClassName("browse-movie-title")(0).innerText
Next post
Next I
driver.Quit
End Sub
I would like to get the same behavior when I try using Html.body.innerHTML in combination with IE but it produces blank screens along with the desired result:
Sub FetchContent()
Const link = "https://yts.am/browse-movies/0/all/action/0/latest?page="
Dim IE As New InternetExplorer, Html As New HTMLDocument, genre$
Dim post As HTMLDivElement, I&, R&
For I = 1 To 4
With IE
.Visible = False
.navigate link & I
While .Busy Or .readyState < 4: DoEvents: Wend
Html.body.innerHTML = .document.DocumentElement.outerHTML
End With
For Each post In Html.getElementsByClassName("browse-movie-bottom")
R = R + 1: Sheets(1).Cells(R, 1) = post.getElementsByClassName("browse-movie-title")(0).innerText
Next post
Next I
IE.Quit
End Sub
Blank screen that appears in every pagination in case of IE combination with Html.body.innerHTML:
I try to test your code with IE and find that I am able to produce the issue.
I find that below line is causing this issue.
Html.body.innerHTML = .document.DocumentElement.outerHTML
Below is the modified version of your code in which this issue is resolved.
Sub FetchContent1()
Const link = "https://yts.am/browse-movies/0/all/action/0/latest?page="
Dim IE As New InternetExplorer, Html As New HTMLDocument, genre$
Dim post As HTMLDivElement, I&, R&
For I = 1 To 4
With IE
.Visible = False
.navigate link & I
While .Busy Or .readyState < 4: DoEvents: Wend
For Each post In .document.getElementsByClassName("browse-movie-bottom")
R = R + 1: Sheets(1).Cells(R, 1) = post.getElementsByClassName("browse-movie-title")(0).innerText
Next post
End With
Next I
IE.Quit
End Sub
Output in IE 11:
Now, It will not open any blank page in IE.

Resources