Capture all the data - excel

I have a question relating to HTML parsing. I would like to catch text from this site into my current spreadsheet, but code can only loop through each 1 page.
Sub Data()
Dim Http As New XMLHTTP60, Html As New HTMLDocument, topic As HTMLHtmlElement
With Http
.Open "GET", "https://voronezh.leroymerlin.ru/catalogue/dekorativnye-oboi/?sortby=1&display=90", False
.send
Html.body.innerHTML = .responseText
End With
For Each topic In Html.getElementsByClassName("ui-product-card__info")
With topic.getElementsByClassName("product-name")
If .Length Then x = x + 1: Cells(x, 1) = .item(0).innerText
End With
With topic.getElementsByClassName("main-value-part")
If .Length Then Cells(x, 2) = .item(0).innerText
End With
Next topic
End Sub
How can I loop the next page in the process to capture all the data?

Do you mean you want to get text from next pages on the website?
you can just continue in the way you do, but just loop through the page number:
Dim i as Integer
For i = 1 to 96
'Do here the same what you were doing, but replace your website string into:
"https://voronezh.leroymerlin.ru/catalogue/dekorativnye-oboi/?
display=90&sortby=1&page=" & i
Next i

Related

Webscraping of product prices and specs

I'm new to the VBA world and i was trying to create a Excel file to get the cheapest price and name tag on a website. I've created one file that goes through the whole search list and gathers each product URL and places on a spreadsheet. My challenge now is in making this second code work. It starts out getting the information without any issues but then, after 10 or more URLs the macro gives a bug and it starts repeating the information on all the following cells.. Is there a way i can make the code run slower so it doesn't get into this problem?
I'll list here my code and a sample of the URLs im scraping..
here's the code i've been using:
Sub test()
Dim URL As String
Set ie = CreateObject("internetexplorer.application")
For i = 2 To 300
URL = Cells(i, 1).Value
ie.navigate URL
ie.Visible = False
Do While ie.busy And ie.readystate <> "readystate_complete"
DoEvents
Loop
Cells(i, 3) = ie.document.getElementsByTagName("h1")(0).innerText
Cells(i, 4) = ie.document.getElementsByTagName("strong")(0).innerText
Next i
ie.Quit
MsgBox "acabou"
End Sub
<https://www.zoom.com.br/notebook/notebook-vaio-vjf157f11x-b0211s-intel-core-i5-8250u-15-6-8gb-ssd-256-gb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-samsung-chromebook-plus-intel-celeron-3965y-12-2-4gb-emmc-32-gb-chrome-os-touchscreen>
<https://www.zoom.com.br/notebook/notebook-dell-xps-7390-intel-core-i7-10710u-13-3-16gb-ssd-512-gb-windows-10-touchscreen>
<https://www.zoom.com.br/notebook/notebook-dell-i15-3583-a5-intel-core-i7-8565u-15-6-8gb-hd-2-tb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-lenovo-b330-intel-core-i5-8250u-15-6-4gb-hd-1-tb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-dell-i15-7580-a20-intel-core-i7-8550u-15-6-8gb-hd-1-tb-geforce-mx150-windows-10>
<https://www.zoom.com.br/notebook/notebook-dell-i14-3480-u30-intel-core-i5-8265u-14-4gb-hd-1-tb-linux-8-geracao>
<https://www.zoom.com.br/notebook/macbook-pro-apple-muhn2bz-intel-core-i5-13-3-8gb-ssd-128-gb-tela-de-retina>
<https://www.zoom.com.br/notebook/notebook-multilaser-pc150-amd-a4-9120-14-2gb-emmc-32-gb-windows-10>
<https://www.zoom.com.br/notebook/notebook-samsung-np930qaa-kw1br-intel-core-i7-8550u-13-3-8gb-ssd-256-gb-windows-10-touchscreen>
<https://www.zoom.com.br/notebook/notebook-acer-a515-51g-58vh-intel-core-i5-7200u-15-6-8gb-hd-1-tb-geforce-940mx>
<https://www.zoom.com.br/notebook/notebook-multilaser-pc222-intel-celeron-dual-core-13-3-4gb-emmc-64-gb-windows-10>
<https://www.zoom.com.br/notebook/notebook-acer-pt515-51-788a-intel-core-i7-9750h-15-6-32gb-ssd-1-tb-geforce-rtx-2070-windows-10>
<https://www.zoom.com.br/notebook/notebook-acer-a315-53-53ak-intel-core-i5-7200u-15-6-4gb-hd-1-tb-windows-10-7-geracao>
<https://www.zoom.com.br/notebook/notebook-dell-i15-5584-m40-intel-core-i7-8565u-15-6-8gb-hd-2-tb-geforce-mx130-windows-10>
<https://www.zoom.com.br/notebook/notebook-acer-a315-41g-r21b-amd-ryzen-5-2500u-15-6-8gb-hd-1-tb-radeon-535-windows-10>
<https://www.zoom.com.br/notebook/notebook-positivo-master-n2140-intel-core-i3-7020u-14-4gb-hd-500-gb-windows-10-7-geracao>
<https://www.zoom.com.br/notebook/notebook-multilaser-pc101-intel-atom-14-1gb-ssd-32-gb-windows-10>
<https://www.zoom.com.br/notebook/notebook-lenovo-b330-intel-core-i5-8250u-15-6-8gb-hd-1-tb-windows-10-8-geracao>
<https://www.zoom.com.br/notebook/notebook-acer-an515-51-77fh-intel-core-i7-7700hq-15-6-8gb-hd-1-tb-geforce-gtx-1050-windows-10>
<https://www.zoom.com.br/notebook/notebook-dell-i15-3583-a2yp-intel-core-i5-8265u-15-6-4gb-optane-16-gb-hd-1-tb-windows-10>
<https://www.zoom.com.br/notebook/notebook-asus-g531gt-intel-core-i7-9750h-15-6-16gb-ssd-512-gb-geforce-gtx-1650-windows-10>
<https://www.zoom.com.br/notebook/notebook-vaio-fit-15s-intel-core-i3-7100u-15-6-4gb-hd-1-tb-windows-10-home>
<https://www.zoom.com.br/notebook/notebook-samsung-s50-intel-core-i7-7500u-13-3-8gb-ssd-256-gb-windows-10-style>
<https://www.zoom.com.br/notebook/notebook-lenovo-b330-intel-core-i3-7020u-15-6-4gb-ssd-120-gb-windows-10-7-geracao>
First of all:
Always declare all variables. To force this, always write Option Explicit as the first line in each module. This way, especially typos in variable names are immediately detected by the compiler.
Here is what to do about your problem:
IE is sometimes a real diva. For example, it doesn't like to have to process URLs in the same instance in quick succession. Therefore it is advisable to kick it out of memory and restart it for each new URL.
To restart it quickly, the deletion of coockies, the cache, etc. must not be set in its settings. Otherwise automation errors will occur.
Try this macro. With the given URLs it works:
Option Explicit
Sub test()
Dim URL As String
Dim ie As Object
Dim i As Long
For i = 2 To 300
If i > 14 Then
ActiveWindow.SmallScroll down:=1
End If
URL = ActiveSheet.Cells(i, 1).Value
Set ie = CreateObject("internetexplorer.application")
ie.navigate URL
ie.Visible = False
Do While ie.readystate <> 4: DoEvents: Loop
ActiveSheet.Cells(i, 3) = ie.document.getElementsByTagName("h1")(0).innerText
ActiveSheet.Cells(i, 4) = ie.document.getElementsByTagName("strong")(0).innerText
ie.Quit
Set ie = Nothing
Next i
MsgBox "acabou"
End Sub
Try this
Sub GetPrices()
Dim html As MSHTML.HTMLDocument, r As Long
For r = 1 To 4
Set html = GetHTML(Cells(r, 1).Value)
Cells(r, 3).Value = html.querySelector("h1.product-name").innerText
Cells(r, 4).Value = Replace(Replace(html.querySelector(".product-price").innerText, "a partir de ", vbNullString), ":( ", "")
Set html = Nothing
Next r
End Sub
Function GetHTML(ByVal sURL As String) As HTMLDocument
Dim http As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument
Set http = New MSXML2.XMLHTTP60
Set html = New MSHTML.HTMLDocument
With http
.Open "Get", sURL, False
.send
html.body.innerHTML = .responseText
End With
Set GetHTML = html
End Function
In python I will do :
You should first import library
from time import sleep
Two functions available :
this function let you sleep before execute every 3 seconds
time.sleep(3)
this function let you random sleep between every execute from 1 to 3 seconds
sleep(randint(1, 3)
Note: take aware about :
cookies because sometimes you need to post request to scrape some id before get.
the syntax of your header and set correctly origin & referrer parameters.

Print title importing from one location to another

I've created a vba script to parse the title of diffetent posts along with the editing status of those posts from a website. What I wish to do now is let my script parse the title from it's landing page but print the title at the same time when it will print the editing status. I do not wish to create two subs for this task. I do not even know if it is possible in vba. However, if anything unclear please check out the comment within my script.
Sub ImportTitleFromAnotherLocation()
Const LINK$ = "https://stackoverflow.com/questions/tagged/web-scraping"
Const prefix$ = "https://stackoverflow.com"
Dim Http As New XMLHTTP60, Html As New HTMLDocument
Dim editInfo As Object, I&, targetUrl$, postTile$
With Http
.Open "GET", LINK, False
.send
Html.body.innerHTML = .responseText
End With
With Html.querySelectorAll(".summary .question-hyperlink")
For I = 0 To .Length - 1
postTitle = .item(I).innerText 'I like this line to be transferred to the location below
targetUrl = Replace(.item(I).getAttribute("href"), "about:", prefix)
With Http
.Open "GET", targetUrl, False
.send
Html.body.innerHTML = .responseText
End With
R = R + 1: Cells(R, 1) = postTitle 'here I wish to use the above line like this
Set editInfo = Html.querySelector(".user-action-time > a")
If Not editInfo Is Nothing Then
Cells(R, 2) = editInfo.innerText
End If
Next I
End With
End Sub
You are overwriting your html document in the loop. A simple way would be to use a second htmldocument variable. A more verbose way would be to store the titles before the loop, for example in an array during an additional loop, then use your i variable to index into that to retrieve each title during the existing loop.
Sub ImportTitleFromAnotherLocation()
Const LINK$ = "https://stackoverflow.com/questions/tagged/web-scraping"
Const prefix$ = "https://stackoverflow.com"
Dim Http As New XMLHTTP60, Html As New HTMLDocument, Html2 As New HTMLDocument
Dim editInfo As Object, I&, targetUrl$, postTile$
Dim postTitle As String, r As Long
With Http
.Open "GET", LINK, False
.send
Html.body.innerHTML = .responseText
End With
With Html.querySelectorAll(".summary .question-hyperlink")
For I = 0 To .Length - 1
postTitle = .item(I).innerText 'I like this line to be transferred to the location below
targetUrl = Replace$(.item(I).getAttribute("href"), "about:", prefix)
With Http
.Open "GET", targetUrl, False
.send
Html2.body.innerHTML = .responseText
End With
r = r + 1: ActiveSheet.Cells(r, 1) = postTitle 'here I wish to use the above line like this
Set editInfo = Html2.querySelector(".user-action-time > a")
If Not editInfo Is Nothing Then
ActiveSheet.Cells(r, 2) = editInfo.innerText
End If
Next I
End With
End Sub

Error 91 Using For Loop with XMLHttpRequest VBA

I am trying to scrape a site for the contents of various HTML tags. I am using an array of html tags against an array of URLs.
Sub pArse(UserForm4_HiddenList_Text)
Dim http As New XMLHTTP60
'Dim html As New HTMLDocument
Dim hero_true_Val As Integer
Dim down_var As Integer
Dim bill_array_redim
Dim element_tag As String
Dim address_count As Integer
hero_true_Val = Application.WorksheetFunction.CountIf(Range("B:B"), True)
down_var = 1
bill_array_redim = Split(UserForm4_HiddenList_Text, Chr(10))
ReDim address_array(hero_true_Val)
For Z = 2 To 6
If Sheets("resource").Cells(Z, 2).Value <> "False" Then
address_count = address_count + 1
address_array(address_count) = Sheets("resource").Cells(Z, 1).Value
End If
Next Z
Sheets("Sheet1").Select
For url_stack = 0 To UBound(bill_array_redim)
Sheets("sheet1").Cells(down_var, 1) = bill_array_redim(url_stack)
For what_to_check = 1 To address_count
With http
.Open "GET", bill_array_redim(url_stack), False
.send
Do While http.readyState <> 4
DoEvents
Loop
End With
Dim html As New HTMLDocument
With html
.body.innerHTML = http.responseText
' the next line is where the error occurs
Sheets("Sheet1").Cells(down_var, 2).Value = .querySelector(address_array(what_to_check)).innerText
End With
down_var = down_var + 1
Next what_to_check
Next url_stack
UserForm4.Hide
End Sub
A user clicks a command button which then feeds in an array of URLs (UserForm4_HiddenList_Text). The code then checks for the existence of the word "true" in the sheet named "resource." This generates an array of terms to check for.
the problem I'm having is that on the second iteration of the what_to_check loop, I'm getting
error 91 : Object or with block variable not set
I'm unsure of what the issue is here and it only seems to occur if I pass in two html tags rather than one.
edit: thanks for replying Cindy. The error location is bolded. The message is "Object or with block variable not set."
Turns out I was being given an error I either didn't fully understand or wasn't very clear. The issue was that the response didn't have the tag inside it that I was looking for as it wasn't present on the site. By using:
With html
.body.innerHTML = http.responseText
If .body.contains(.querySelector(tag_array(tag_array_index_no)))
Then
Sheets(1).Range(Cells(down_var, 2).Address).Value =
tag_array(tag_array_index_no)
Sheets(1).Range(Cells(down_var, 3).Address).Value =
.querySelector(tag_array(tag_array_index_no)).innerText
Else
Sheets(1).Range(Cells(down_var, 2).Address).Value =
tag_array(tag_array_index_no)
Sheets(1).Range(Cells(down_var, 3).Address).Value =
tag_array(tag_array_index_no) + " not set."
End If
End With
I'm able to check if the tag exists in the response text.

Object does not support this property of method, while parsing html document

I want to scrap every post heading form this blog. I am using the code bellow but it's giving me an error "Run time error 438 object does not support this property or method" in line
Cells(i, 1).Value = ele.getElementsByClassName("entry-title")(0).getElementsByTagName("a")(0).innerText
The code is:
Private Sub CommandButton1_Click()
Dim bot As Object
Dim doc As New HTMLDocument
Dim ele As HTMLElementCollection
Dim i As Long
Set bot = CreateObject("MSXML2.XMLHTTP")
bot.Open "GET", "http://themakeupblogger.com/makeup/", False
bot.send
doc.body.innerHTML = bot.responseText
For Each ele In doc.getElementsByTagName("article")
i = Cells(Rows.Count, 1).End(xlUp).Row + 1
Cells(i, 1).Value = ele.getElementsByClassName("entry-title")(0).getElementsByTagName("a")(0).innerText
Next ele
End Sub
Give this a shot and get all the titles you are after.
Sub demo()
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim r As Long, ele As Object
With http
.Open "GET", "http://themakeupblogger.com/makeup/", False
.send
html.body.innerHTML = .responseText
End With
For Each elem In html.getElementsByClassName("entry-title")
With elem.getElementsByTagName("a")
If .Length Then r = r + 1: Cells(r, 1) = .Item(0).innerText
End With
Next elem
End Sub
Reference to add to the library:
1. Microsoft XML, v6.0
2. Microsoft HTML Object Library
Partial results:
4 High-Coverage Foundations That Might As Well Be Skincare
10 Memorial Day Beauty Essentials That Belong In Your Beach Bag
Don’t Get Married Without These Wedding Day Makeup Tips (Courtesy of a Makeup Artist)
To get the articles from that page you can do something like:
Sub demo()
Dim http As New InternetExplorer, html As New HTMLDocument
Dim r As Long, elem As Object
With http
.Visible = False
.navigate "http://themakeupblogger.com/makeup/"
Do Until .readyState = READYSTATE_COMPLETE: Loop
Set html = .document
End With
For Each elem In html.getElementsByTagName("article")
With elem.getElementsByTagName("h1")
If .Length Then r = r + 1: Cells(r, 1) = .Item(0).getElementsByTagName("a")(0).innerText
End With
With elem.getElementsByTagName("div")(3).getElementsByTagName("p")
If .Length Then Cells(r, 2) = .Item(0).innerText
End With
Next elem
End Sub
This time the reference you should add to the library:
1. Microsoft Internet Controls
2. Microsoft HTML Object Library

Excel Macro to draw thread comments from website into cells

I am trying to store Reddit thread comments in an excel spreadsheet, however I have had trouble trying to figure out how to do this. I do not have much experience with using macros to get data from webpages, so I have been finding it hard to figure out how exactly to draw out each comment from a specified Reddit thread and place it in a cell, and whether or not it is possible to do.
This is what I have so far:
Sub getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
With htm.getelementbyid("comments")
Set cellrangex = .Rows(x).Cells.Length - 1
Set cellrangey = .Rows(x).Cells.Length - 1
Set cellrange1 = Sheets(1).Cells(x + 1, y + 1).Value
Set cellrange2 = .Rows(x).Cells(y).innertext
For x = 0 To cellrangex
For y = 0 To cellrangey
cellrange = cellrange2
Next y
Next x
End With
End Sub
You'll really need to analyze the contents of the web page you are scraping with a decent HTML editor. I would suggest navigating to the page in question in chrome and using F12 to open it's developer tool. In the "Elements" tab you can quickly see which HTML is producing which part of the page (open both the page and the developer tools next to each other).
You'll notice as you head into the comments that the text of each comment is inside a <p> tag and each <p> tag is inside a <div>. We are looking for patterns, so this is a good start.
You'll also notice that each one of those <div> tags has a class of md.
So... Lets load all of the pages <div> tags into an object and then look for the ones that have a className that contains "md":
Sub getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
Set Divelements = htm.getElementsByTagName("div")
For Each DivElement In Divelements
If InStr(1, DivElement.ClassName, "md") Then
'print contents to the Immediate window for debugging View>>Immediate Window to insure it's up in your VBE
Debug.Print DivElement.InnerText
End If
Next
End Sub
With that you'll see all of the comments stuck in the Immediate window (go to View>>Immediate Window) so you can see this debug output.
After skipping around the nodes it looks like you can navigate up a couple of elements and back down the tree to get the username:
Sub getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
Set Divelements = htm.getElementsByTagName("div")
On Error Resume Next
For Each divElement In Divelements
If InStr(1, divElement.className, "md") And Not InStr(1, divElement.className, "md-container") Then
Set commentEntry = divElement.ParentNode.ParentNode.ParentNode
'Print the name and the comment
Debug.Print commentEntry.FirstChild.FirstChild.NextSibling.InnerText & ":", divElement.InnerText
End If
Next
End Sub
To print this out to the sheet just point to a cell instead of the debug.print immediate window. Something like:
Sub getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Dim ws As Worksheet, wsCell As Integer
'set the worksheet to print to and the first row to start printing.
Set ws = Sheets("Sheet1")
wsCell = 1
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
Set Divelements = htm.getElementsByTagName("div")
On Error Resume Next
For Each divElement In Divelements
If InStr(1, divElement.className, "md") And Not InStr(1, divElement.className, "md-container") Then
Set commentEntry = divElement.ParentNode.ParentNode.ParentNode
'Print the name and the comment to ws sheet columns 1 and 2
ws.Cells(wsCell, 1).Value = commentEntry.FirstChild.FirstChild.NextSibling.InnerText
ws.Cells(wsCell, 2).Value = divElement.InnerText
'iterate to the next row
wsCell = wsCell + 1
End If
Next
End Sub

Resources