Webscraping of product prices and specs - excel

I'm new to the VBA world and i was trying to create a Excel file to get the cheapest price and name tag on a website. I've created one file that goes through the whole search list and gathers each product URL and places on a spreadsheet. My challenge now is in making this second code work. It starts out getting the information without any issues but then, after 10 or more URLs the macro gives a bug and it starts repeating the information on all the following cells.. Is there a way i can make the code run slower so it doesn't get into this problem?
I'll list here my code and a sample of the URLs im scraping..
here's the code i've been using:
Sub test()
Dim URL As String
Set ie = CreateObject("internetexplorer.application")
For i = 2 To 300
URL = Cells(i, 1).Value
ie.navigate URL
ie.Visible = False
Do While ie.busy And ie.readystate <> "readystate_complete"
DoEvents
Loop
Cells(i, 3) = ie.document.getElementsByTagName("h1")(0).innerText
Cells(i, 4) = ie.document.getElementsByTagName("strong")(0).innerText
Next i
ie.Quit
MsgBox "acabou"
End Sub
<https://www.zoom.com.br/notebook/notebook-vaio-vjf157f11x-b0211s-intel-core-i5-8250u-15-6-8gb-ssd-256-gb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-samsung-chromebook-plus-intel-celeron-3965y-12-2-4gb-emmc-32-gb-chrome-os-touchscreen>
<https://www.zoom.com.br/notebook/notebook-dell-xps-7390-intel-core-i7-10710u-13-3-16gb-ssd-512-gb-windows-10-touchscreen>
<https://www.zoom.com.br/notebook/notebook-dell-i15-3583-a5-intel-core-i7-8565u-15-6-8gb-hd-2-tb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-lenovo-b330-intel-core-i5-8250u-15-6-4gb-hd-1-tb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-dell-i15-7580-a20-intel-core-i7-8550u-15-6-8gb-hd-1-tb-geforce-mx150-windows-10>
<https://www.zoom.com.br/notebook/notebook-dell-i14-3480-u30-intel-core-i5-8265u-14-4gb-hd-1-tb-linux-8-geracao>
<https://www.zoom.com.br/notebook/macbook-pro-apple-muhn2bz-intel-core-i5-13-3-8gb-ssd-128-gb-tela-de-retina>
<https://www.zoom.com.br/notebook/notebook-multilaser-pc150-amd-a4-9120-14-2gb-emmc-32-gb-windows-10>
<https://www.zoom.com.br/notebook/notebook-samsung-np930qaa-kw1br-intel-core-i7-8550u-13-3-8gb-ssd-256-gb-windows-10-touchscreen>
<https://www.zoom.com.br/notebook/notebook-acer-a515-51g-58vh-intel-core-i5-7200u-15-6-8gb-hd-1-tb-geforce-940mx>
<https://www.zoom.com.br/notebook/notebook-multilaser-pc222-intel-celeron-dual-core-13-3-4gb-emmc-64-gb-windows-10>
<https://www.zoom.com.br/notebook/notebook-acer-pt515-51-788a-intel-core-i7-9750h-15-6-32gb-ssd-1-tb-geforce-rtx-2070-windows-10>
<https://www.zoom.com.br/notebook/notebook-acer-a315-53-53ak-intel-core-i5-7200u-15-6-4gb-hd-1-tb-windows-10-7-geracao>
<https://www.zoom.com.br/notebook/notebook-dell-i15-5584-m40-intel-core-i7-8565u-15-6-8gb-hd-2-tb-geforce-mx130-windows-10>
<https://www.zoom.com.br/notebook/notebook-acer-a315-41g-r21b-amd-ryzen-5-2500u-15-6-8gb-hd-1-tb-radeon-535-windows-10>
<https://www.zoom.com.br/notebook/notebook-positivo-master-n2140-intel-core-i3-7020u-14-4gb-hd-500-gb-windows-10-7-geracao>
<https://www.zoom.com.br/notebook/notebook-multilaser-pc101-intel-atom-14-1gb-ssd-32-gb-windows-10>
<https://www.zoom.com.br/notebook/notebook-lenovo-b330-intel-core-i5-8250u-15-6-8gb-hd-1-tb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-acer-an515-51-77fh-intel-core-i7-7700hq-15-6-8gb-hd-1-tb-geforce-gtx-1050-windows-10>
<https://www.zoom.com.br/notebook/notebook-dell-i15-3583-a2yp-intel-core-i5-8265u-15-6-4gb-optane-16-gb-hd-1-tb-windows-10>
<https://www.zoom.com.br/notebook/notebook-asus-g531gt-intel-core-i7-9750h-15-6-16gb-ssd-512-gb-geforce-gtx-1650-windows-10>
<https://www.zoom.com.br/notebook/notebook-vaio-fit-15s-intel-core-i3-7100u-15-6-4gb-hd-1-tb-windows-10-home>
<https://www.zoom.com.br/notebook/notebook-samsung-s50-intel-core-i7-7500u-13-3-8gb-ssd-256-gb-windows-10-style>
<https://www.zoom.com.br/notebook/notebook-lenovo-b330-intel-core-i3-7020u-15-6-4gb-ssd-120-gb-windows-10-7-geracao>

First of all:
Always declare all variables. To force this, always write Option Explicit as the first line in each module. This way, especially typos in variable names are immediately detected by the compiler.
Here is what to do about your problem:
IE is sometimes a real diva. For example, it doesn't like to have to process URLs in the same instance in quick succession. Therefore it is advisable to kick it out of memory and restart it for each new URL.
To restart it quickly, the deletion of coockies, the cache, etc. must not be set in its settings. Otherwise automation errors will occur.
Try this macro. With the given URLs it works:
Option Explicit
Sub test()
Dim URL As String
Dim ie As Object
Dim i As Long
For i = 2 To 300
If i > 14 Then
ActiveWindow.SmallScroll down:=1
End If
URL = ActiveSheet.Cells(i, 1).Value
Set ie = CreateObject("internetexplorer.application")
ie.navigate URL
ie.Visible = False
Do While ie.readystate <> 4: DoEvents: Loop
ActiveSheet.Cells(i, 3) = ie.document.getElementsByTagName("h1")(0).innerText
ActiveSheet.Cells(i, 4) = ie.document.getElementsByTagName("strong")(0).innerText
ie.Quit
Set ie = Nothing
Next i
MsgBox "acabou"
End Sub

Try this
Sub GetPrices()
Dim html As MSHTML.HTMLDocument, r As Long
For r = 1 To 4
Set html = GetHTML(Cells(r, 1).Value)
Cells(r, 3).Value = html.querySelector("h1.product-name").innerText
Cells(r, 4).Value = Replace(Replace(html.querySelector(".product-price").innerText, "a partir de ", vbNullString), ":( ", "")
Set html = Nothing
Next r
End Sub
Function GetHTML(ByVal sURL As String) As HTMLDocument
Dim http As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument
Set http = New MSXML2.XMLHTTP60
Set html = New MSHTML.HTMLDocument
With http
.Open "Get", sURL, False
.send
html.body.innerHTML = .responseText
End With
Set GetHTML = html
End Function

In python I will do :
You should first import library
from time import sleep
Two functions available :
this function let you sleep before execute every 3 seconds
time.sleep(3)
this function let you random sleep between every execute from 1 to 3 seconds
sleep(randint(1, 3)
Note: take aware about :
cookies because sometimes you need to post request to scrape some id before get.
the syntax of your header and set correctly origin & referrer parameters.

Related

Scrape with xmlhttp

I would like to get data from https://www.goaloong.net/football/6in1
This page contains a table.
I tried with:
Sub REQUESTXML()
Dim XMLHttpRequest As xmlHttp
Dim HTMLDoc As New HTMLDocument
Dim elem As Object
Dim x As Long
Set XMLHttpRequest = New MSXML2.xmlHttp
XMLHttpRequest.Open "GET", "https://www.goaloong.net/football/6in1", False
XMLHttpRequest.send
While XMLHttpRequest.readyState = 200
DoEvents
Wend
Debug.Print XMLHttpRequest.responseText
HTMLDoc.Body.innerHTML = XMLHttpRequest.responseText
x = 1
For Each elem In HTMLDoc.getElementsByClassName("Leaguestitle")
Sheets("req").Range("A" & x).Value = HTMLDoc.getElementsByTagName("a")(0).innerText
x = x + 1
Next elem
End Sub
I have no result.
Kindly help me?
The page https://www.goaloong.net/football/6in1 is dynamic, i.e. first the java scripts are loaded, then the scripts are loading the content. One approach is to load the full page content in IE and get it out of it. Example below (tested):
Sub REQUESTXML()
Dim IE As New InternetExplorer
Dim elem As Object
Dim x As Long
IE.navigate "https://www.goaloong.net/football/6in1"
Do While IE.readyState = READYSTATE_COMPLETE: DoEvents: Loop
Do Until IE.readyState = READYSTATE_COMPLETE: DoEvents: Loop
'for debug purpose
Open ThisWorkbook.Path & "\TESTFILE.html" For Output As #1
Print #1, IE.document.body.innerHTML
Close #1
x = 1
For Each elem In IE.document.getElementsByClassName("Leaguestitle")
Sheets(1).Range("A" & x).Value = elem.innerText
x = x + 1
Next elem
IE.Quit
End Sub
If you're ok with using a DLL and rewrite your code, you can run Microsoft's Edge browser (a Chrome-based browser) with VBA. With that you can do almost anything you want. Note however, that access to the DOM is performed by javascript, not by an object like
Dim IE As New InternetExplorer. Look at the VBA sample and you'll get the grasp.
https://github.com/peakpeak-github/libEdge
Sidenote: Samples for C# and C++ are also included.

Cannot get the text inside a <p> tag using VBA

I have the following URL
https://www.wavemotion.gr/shop/smartphone-accessories/easy-one-touch-wireless-fast-charging-mount
I am trying to get the availability of the product by using the following
For i = 2 To lastrow
mylink = wks.Cells(i, 2).Value
ie.Navigate mylink
While ie.Busy Or ie.ReadyState < 4: DoEvents: Wend
t = Timer
Do
DoEvents
On Error Resume Next
Set instock = ie.Document.querySelector(".stock.in-stock").innerText
If instock Is Nothing Then
Set availability = ie.Document.querySelector(".stock.out-of-stock").innerText
Else
Set availability = instock
End If
wks.Cells(i, "D") = availability
If Timer - t > MAX_WAIT_SEC Then Exit Do
On Error GoTo 0
Loop
Next i
But I get allways nothing on
Set instock = ie.Document.querySelector(".stock.in-stock").innerText
I checked the query on
https://try.jsoup.org/
It is working
What I am doing wrong here? There is not any id to target only class name
<p class="stock in-stock">Διαθέσιμο</p>
So, what's happening here is that you're trying to Set string datatype innerText to object variable instock. The reason it's returning Nothing is because your On Error Resume Next statement is suppressing the error message. If you took that out and ran it, you would get a Type Mismatch. What you'd need to do is split it into a line that assigns the object to the object variable and then a line that reads the innerText of the assigned object.
Set instock = ie.Document.querySelector(".stock.in-stock")
If instock Is Nothing Then
Set availability = ie.Document.querySelector(".stock.out-of-stock")
Else
Set availability = instock
End If
wks.Cells(i, "D") = availability.innerText
There is a better, faster way. Use xmlhttp and parse that info out of the json stored in one of the script tags. If issuing large numbers of requests you may need to add a wait every x number of requests in case of throttling/blocking. Note: You can use the same approach with InternetExplorer and thus remove many of your lines of code, though you have another library (.bas) dependancy.
You need to install jsonconverter.bas from here and go vbe > tools > references > and add a reference to Microsoft Scripting Runtime
Option Explicit
Public Sub GetStocking()
Dim json As Object, html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.wavemotion.gr/shop/smartphone-accessories/easy-one-touch-wireless-fast-charging-mount", False
.send
html.body.innerHTML = StrConv(.responseBody, vbUnicode)
End With
Set json = JsonConverter.ParseJson(html.querySelector("script[type='application/ld+json']").innerHTML)
Debug.Print json("offers")("availability")
End Sub
This is what the entire json contains:
Internet Explorer version:
Option Explicit
Public Sub GetInfo()
Dim ie As New InternetExplorer, i As Long, s As String, scripts As Object, json As Object
With ie
.Visible = False
.Navigate2 "https://www.wavemotion.gr/shop/smartphone-accessories/easy-one-touch-wireless-fast-charging-mount"
While .Busy Or .readyState < 4: DoEvents: Wend
Set scripts = .document.querySelectorAll("script[type='application/ld+json']")
For i = 0 To scripts.Length - 1
s = scripts.item(i).innerHTML
If InStr(s, "availability") > 0 Then
Set json = JsonConverter.ParseJson(s)
Exit For
End If
Next
.Quit
If Not json Is Nothing Then Debug.Print json("offers")("availability")
End With
End Sub

Crashing with loop of xmlhttp requests

I am trying to rewrite this macro from ie.application to http requests in VBA.
The macro navigates to a URL, GETs the html then parses and scrapes the required data.
It works well with one request, but when I try to do this with a FOR loop with thousands of URLs Excel crashes.
I think I should "close" the connection in some way before going to the next loop, but I did not find a solution.
Here is my code:
Sub GetQuotes()
Dim xmlhttp As New MSXML2.XMLHTTP60, myurl As String
Dim html As New HTMLDocument
For r = 1 To 10
'*****GO TO PRODUCT PAGE*****
path= ThisWorkbook.Worksheets("Sheet1").Cells(r, 1).Value
myurl = "https://some_domain.com" + path
xmlhttp.Open "GET", myurl, False
xmlhttp.send
html.body.innerHTML = xmlhttp.responseText
'*****GET PRICE*****
If Not html.getElementById("some_id") Is Nothing Then
price = html.getElementById("some_id").innerHTML
ThisWorkbook.Worksheets("sheet1").Cells(r, 2).Value = price
Else
price = "empty"
ThisWorkbook.Worksheets("sheet1").Cells(r, 2).Value = price
End If
Next r
End Sub
XML and HTML references are active
Squeeze in a
Do While xmlhttp.ReadyState <> 4
DoEvents
Loop
after the send. Your current code will fire all requests almost at once. With this, you will wait for the page to load before proceeding.
This worked:
A do-while with DoEvents inside just after the request is sent.
A DoEvents alone just before the next r.
Using one of them only, did not work.

web scraping using excel and VBA

i wrote my VBA code in excel sheet as below but it is not scrape data for me and also i don't know why please any one help me. it gave me reullt as "click her to read more" onlyi want to scrape enitre data such as first name last name state zip code and so on
Sub extractTablesData()
Dim IE As Object, obj As Object
Dim myState As String
Dim r As Integer, c As Integer, t As Integer
Dim elemCollection As Object
Set IE = CreateObject("InternetExplorer.Application")
myState = InputBox("Enter the city where you wish to work")
With IE
.Visible = True
.navigate ("http://www.funeralhomes.com/go/listing/Search? name=&city=&state=&country=USA&zip=&radius=")
While IE.readyState <> 4
DoEvents
Wend
For Each obj In IE.document.all.item("state").Options
If obj.innerText = myState Then
obj.Selected = True
End If
Next obj
IE.document.getElementsByValue("Search").item.Click
Do While IE.Busy: DoEvents: Loop
ThisWorkbook.Sheets("Sheet1").Range("A1:K1500").ClearContents
Set elemCollection = IE.document.getElementsByTagName("TABLE")
For t = 0 To (elemCollection.Length - 1)
For r = 0 To (elemCollection(t).Rows.Length - 1)
For c = 0 To (elemCollection(t).Rows(r).Cells.Length - 1)
ThisWorkbook.Worksheets(1).Cells(r + 1, c + 1) = elemCollection(t).Rows(r).Cells(c).innerText
Next c
Next r
Next t
End With
Set IE = Nothing
End Sub
Using the same URL as the answer already given you could alternatively select with CSS selectors to get the elements of interest, and use split to get just the names and address parts from the text. We can also do away with the browser altogether to get faster results from first results page.
Business name:
You can get the name with the following selector (using paid listing example):
div.paid-listing .listing-title
This selects (sample view)
Try
Address info:
The associated descriptive information can be retrieved with the selector:
div.paid-listing .address-summary
And then using split we can parse this into just the address information.
Code:
Option Explicit
Public Sub GetTitleAndAddress()
Dim oHtml As HTMLDocument, nodeList1 As Object, nodeList2 As Object, i As Long
Const URL As String = "http://www.funeralhomes.com/go/listing/ShowListing/USA/New%20York/New%20York"
Set oHtml = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", URL, False
.send
oHtml.body.innerHTML = .responseText
End With
Set nodeList1 = oHtml.querySelectorAll("div.paid-listing .listing-title")
Set nodeList2 = oHtml.querySelectorAll("div.paid-listing .address-summary")
With Worksheets("Sheet3")
.UsedRange.ClearContents
For i = 0 To nodeList1.Length - 1
.Range("A" & i + 1) = nodeList1.Item(i).innerText
.Range("B" & i + 1) = Split(nodeList2.Item(i).innerText, Chr$(10))(0)
Next i
End With
End Sub
Example output:
Yeah, without an API, this can be very tricky at best, and very inconsistent at worst. For now, you can try the script below.
Sub DumpData()
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
URL = "http://www.funeralhomes.com/go/listing/ShowListing/USA/New%20York/New%20York"
'Wait for site to fully load
IE.Navigate2 URL
Do While IE.Busy = True
DoEvents
Loop
RowCount = 1
With Sheets("Sheet1")
.Cells.ClearContents
RowCount = 1
For Each itm In IE.document.all
If itm.classname Like "*free-listing*" Or itm.classname Like "*paid-listing*" Then
.Range("A" & RowCount) = itm.classname
.Range("B" & RowCount) = Left(itm.innertext, 1024)
RowCount = RowCount + 1
End If
Next itm
End With
End Sub
You probably want some kind of input box to capture the city and state and radius from the user, or capture those variable in cells in your worksheet.
Notice, the '%20' is a space character.
I got this idea from a friend of mine, Joel, a long time ago. That guy is great!

Web-scraping on intranet

I wrote a VBA code to scrape data from my company's intranet.
Issues:
The below error occurs:
Run-time error '91':
object variable or with block variable not set
It happens on:
myPoints = Trim(Doc.getElementsByName("price")(0).getAttribute("value"))
When I debug it and run line by line, it can retrieve all the values.
Input and Output:
I input multiple product ID on column B and retrieve data on column C:
Column B = product ID
Column C = price
HTML:
<td id="myPower_val_9" style="visibility: visible;">
<input type="text" disabled="disabled" value="300" name="price"></input>
</td>
VBA:
Sub Button1_Click()
Dim ie As Object
Dim r As Integer
Dim myPoints As String
Dim Doc As HTMLDocument
Set ie = New InternetExplorerMedium
For r = 2 To Range("B65535").End(xlUp).Row
With ie
.Visible = 0
.navigate "www.example.com/product/" & Cells(r, "B").Value
Do Until .readyState = 4
DoEvents
Loop
End With
Set Doc = ie.document
myPoints = Trim(Doc.getElementsByName("price")(0).getAttribute("value"))
Cells(r, "C").Value = myPoints
Next r
End Sub
Have I missed an error handler?
You need to wait for the document to be fully rendered and the DOM available before accessing any elements. ie.ReadyState changes to READYSTATE_COMPLETE once the page connects and starts loading. The reason that your code works when debugging is that in the couple of seconds it takes for you to start working with the debugger, the page finishes loading.
With ie
.Visible = True
.Navigate "www.example.com/product/" & Cells(r, "B").Value
Do Until .ReadyState = READYSTATE_COMPLETE
DoEvents
Loop
Do Until .Document.ReadyState = "complete"
DoEvents
Loop
End With
I would also recommend that you make the ie Window visible, at least while you're developing. Once you've got your functionality complete and debugging, you can make the window invisible. Keep in mind if you forget to close your invisible IE windows when your code finishes, your users will end up with runaway iexplore.exe processes.
If you only want to ignore the error and continue with the next iteration, use this modified code:
Sub Button1_Click()
Dim ie As Object
Dim r As Integer
Dim myPoints As String
Dim Doc As HTMLDocument
Set ie = New InternetExplorerMedium
For r = 2 To Range("B65535").End(xlUp).Row
With ie
.Visible = 0
.navigate "www.example.com/product/" & Cells(r, "B").Value
Do Until .readyState = 4
DoEvents
Loop
End With
Set Doc = ie.document
'Edit:
myPoints = ""
On Error Resume Next
myPoints = Trim(Doc.getElementsByName("price")(0).getAttribute("value"))
On Error Goto 0
Cells(r, "C").Value = myPoints
Next r
End Sub
You could also loop until element is set (add a timeout clause as well)
Dim a As Object
Do
DoEvents
On Error Resume Next
Set a = Doc.getElementsByName("price")
On Error GoTo 0
Loop While a Is Nothing

Resources