Webscrape inside a for loop - Follow up - excel

Following my previous question Webscrape VBA with condition, I started trying to automate the procedure for a list of url from this website here that I prepared in my excel document. When I tried for 20 and 30 url it worked perfectly, yet when I increased it, a "Script out of range error" occurred concerning the ReDim in the GetNodesTextAsArray, do you have any idea why ?
After some research I Tried to replace it by a for loop but it doesn't chagne anything.
Public Sub WindInfo()
'VBE> Tools > References:
'1. Microsoft, XML v6
'2. Microsoft HTML Object Library
'3. Microsoft Scripting Runtime
Dim xhr As MSXML2.XMLHTTP60: Set xhr = New MSXML2.XMLHTTP60
Dim html As MSHTML.HTMLDocument: Set html = New MSHTML.HTMLDocument
Dim ws As Worksheet: Set ws = ThisWorkbook.Worksheets("Sheet1")
Dim url As String
Dim j As Integer
Dim r As Long
r = 1
For j = 1 To 20
url = Worksheets("List").Cells(j, 1).Value
With xhr
.Open "GET", url, False
.send
html.body.innerHTML = .responseText
End With
Dim generalities As Object, arrGen(), partsList As Object
Set generalities = html.querySelectorAll("#bloc_texte table ~ table li")
arrGen = GetNodesTextAsArray(generalities)
Dim parts As Object, numberOfParts As Long
Set partsList = html.querySelectorAll("h1 ~ h3, ul ~ h3")
If partsList.Length > 0 Then
numberOfParts = html.querySelectorAll("h1 ~ h3, ul ~ h3").Length / 2
Set parts = html.querySelectorAll("h3 + ul")
Dim i As Long, liNodes As Object, arr()
Dim html2 As MSHTML.HTMLDocument: Set html2 = New MSHTML.HTMLDocument
For i = 0 To numberOfParts - 1
ws.Cells(r, 1).Resize(1, UBound(arrGen)) = arrGen
html2.body.innerHTML = parts.Item(i).outerHTML & parts.Item(i + numberOfParts).outerHTML
Set liNodes = html2.querySelectorAll("li")
arr = GetNodesTextAsArray(liNodes)
ws.Cells(r, 5).Resize(1, UBound(arr)) = arr
r = r + 1
Next
Else
arr = GetNodesTextAsArray(html.querySelectorAll("#bloc_texte h1 + ul").Item(1).getElementsByTagName("li"))
ws.Cells(r, 1).Resize(1, UBound(arrGen)) = arrGen
ws.Cells(r, 5).Resize(1, UBound(arr)) = arr
r = r + 1
End If
Application.Wait (Now + TimeValue("0:00:01"))
Next
End Sub
Public Function GetNodesTextAsArray(ByVal nodeList As Object) As Variant()
Dim i As Long, results()
ReDim results(1 To nodeList.Length)
For i = 0 To nodeList.Length - 1
results(i + 1) = nodeList.Item(i).innerText
Next i
GetNodesTextAsArray = results
End Function

9/10 the problem with VBA webscraping is page loading related.
So what you want to do is try an error handler that Loops while the page loads, and then tests to see if the element appears on the page.
I'll leave the code to you, but the timing here is essentially everything since even status code checking and waiting for the document to be loaded checking, is not a guarantee everything you need is loaded.
Even with great tools like selenium and puppeteer you still need to check / handle element loading and timing, and cant just assume, since document loaded, the element is there.
makes sense?

Related

Web Scraping - Problems with tags

I am new scraping web data and also using For...Next. I am trying to get data (all pages) from a website but it seems the code is wrong, since I get error 91. This is the code:
Dim ie As Object
Sub connect()
Set ie = CreateObject("INTERNETEXPLORER.APPLICATION")
ie.NAVIGATE "https://www.worldathletics.org/world-rankings/100m/men"
ie.Visible = True
End Sub
Sub id_tr_td_for()
Range("a1:z10000").ClearContents
For i = 0 To 10
For j = 0 To 5
Cells(i + 1, j + 1) = ie.document.getElementById("toplists").getElementsByTagName("tr")(i).getElementsByTagName("td")(j).innerText
Next
Next
End Sub
Can somebody help me with it and also to let me know who can I list all pages?
Thank you.
I'm not sure where the error comes from, I got it too.
The following code should be helpful, it will print the contents of the table for the specified page(s) to the debug window.
The following code should copy all the data for selected pages to sheet1
You will need to Add a couple of references in the VBA Editor to be able to use it. (Tools Menu, References and then find and select them) Microsoft HTML Object Library and Microsoft Internet Controls
Const MaxPage = 2 ' set to 26 (or however many there are) - at 2 for testing purposes
Dim Browser As InternetExplorer
Sub Start()
Dim Page As Integer: Page = 1 ' start at page 1
Dim PageDocument As IHTMLDocument
Dim RecordRow As IHTMLElementCollection
Dim RecordItem As IHTMLElement
Dim Sheet As Worksheet: Set Sheet = ThisWorkbook.Worksheets("Sheet1") ' output sheet
If Browser Is Nothing Then
Set Browser = New InternetExplorer
End If
Dim oRow As Integer: oRow = 2 ' begin output at row 2 (account for header)
Dim Record As Integer
For Page = 1 To MaxPage
LoadPage Page
For Record = 0 To 99 ' zero index, 100 items (1-99)
Set PageDocument = Browser.Document
Set RecordRow = PageDocument.getElementById("toplists").getElementsByTagName("table")(0).getElementsByTagName("tbody")(0).getElementsByTagName("tr")(Record).getElementsByTagName("td")
Sheet.Cells(oRow, 1).Value = Trim(RecordRow(0).innerText)
Sheet.Cells(oRow, 2).Value = Trim(RecordRow(1).innerText)
Sheet.Cells(oRow, 3).Value = Trim(RecordRow(2).innerText)
Sheet.Cells(oRow, 4).Value = Trim(RecordRow(3).innerText)
Sheet.Cells(oRow, 5).Value = Trim(RecordRow(4).innerText)
Sheet.Cells(oRow, 6).Value = Trim(RecordRow(5).innerText)
oRow = oRow + 1
Next Record
Next Page
Browser.Quit
End Sub
Sub LoadPage(ByVal PageNumber As Integer)
Debug.Print "Navigating to Page #" & CStr(PageNumber)
Browser.navigate "https://www.worldathletics.org/world-rankings/100m/men?page=" & CStr(PageNumber)
While Browser.readyState <> 4 Or Browser.Busy: DoEvents: Wend
Debug.Print "Navigation Complete"
End Sub
Updated Code
The Index Out-of-Bound error likely occurred due to the hard-coded indexes, if a page does not have 99 records it will fail, if a record doesn't have 5 fields, it will fail. The following code does away with indexes and just scrapes every row and cell it finds. You shouldn't get index errors but the output could be jagged.
Further Update
The 462 error was caused by the Browser.Quit. This closes the browser but does not set the reference to Nothing so when you run the code again it is trying to use a non-existent browser. Explicitly setting it to nothing at the end fixes this.
There is no link in the competitor column, the whole row has a data-url which is handled by something else. That URL can easily be accessed though.
Sub NewStart()
Dim PageDocument As IHTMLDocument
Dim Records As IHTMLElementCollection
Dim Record As IHTMLElement
Dim RecordItems As IHTMLElementCollection
Dim RecordItem As IHTMLElement
Dim OutputRow As Integer: OutputRow = 2
Dim OutputColumn As Integer
Dim Page As Integer
Dim Sheet As Worksheet: Set Sheet = ThisWorkbook.Worksheets("Sheet1")
If Browser Is Nothing Then
Set Browser = New InternetExplorer
Browser.Visible = True
End If
For Page = 1 To MaxPage
LoadPage Page
Set PageDocument = Browser.Document
Set Records = PageDocument.getElementById("toplists").getElementsByTagName("table")(0).getElementsByTagName("tbody")(0).getElementsByTagName("tr")
For Each Record In Records
Set RecordItems = Record.getElementsByTagName("td")
OutputColumn = 1
For Each RecordItem In RecordItems
Sheet.Cells(OutputRow, OutputColumn).Value = Trim(RecordItem.innerText)
OutputColumn = OutputColumn + 1
Next RecordItem
Sheet.Cells(OutputRow, OutputColumn).Value = "http://worldathletics.org/" & Record.getAttribute("data-athlete-url") ' This will add the link after the last column
OutputRow = OutputRow + 1
Next Record
Next Page
Browser.Quit
Set Browser = Nothing ' This will fix the 462 error
End Sub

Get pictures links from Google Search

I am trying to get the pictures links from a searched link through google and this is my try
Sub Test()
Const sURL As String = "https://www.google.com.eg/search?q=baby&sxsrf=ALeKk01tyfvvxyYjaC0YctjxaY0RlvPnuw:1586804351129&source=lnms&tbm=isch&sa=X&ved=2ahUKEwjB77TtiuboAhUl5uAKHR5KA2wQ_AUoAXoECBQQAw&biw=1280&bih=881"
Dim http As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument
Set http = New MSXML2.XMLHTTP60
Set html = New MSHTML.HTMLDocument
With http
.Open "Get", sURL, False
.send
html.body.innerHTML = .responseText
End With
Dim post As Object, i As Long
Set post = html.querySelectorAll(".mM5pbd .bRMDJf")
For i = 0 To post.Length - 1
Debug.Print post.Item(i).innerHTML
Next i
Stop
End Sub
First I got the post.Length only 20 while I expect about 300
Second I can't get the correct link for the picture as it seems it is base64 encrypted or something similar (I am not sure)
How can I get the real links for the picture and get all the links for all the pictures related?
I think it is solved for one point
Set post = html.querySelectorAll("a.VFACy.kGQAp")
For i = 0 To post.Length - 1
Debug.Print post.Item(i).href
Next i
But how to get all the links instead of the 20 links only?
** The links are not totally right, for example I got this link
https://www.fool.com/taxes/2018/03/27/are-you-having-a-baby-here-are-the-tax-breaks-you.aspx
While the correct link is
https://g.foolcdn.com/editorial/images/466737/new-parents-holding-newborn-baby-mom-dad-father-mother.jpg
** I tried using IE
Sub TestIE()
Dim ie As New InternetExplorer
Dim lastrow As Long
Dim i As Long
Dim j As Long
lastrow = Range("A" & Rows.Count).End(xlUp).Row
For i = 2 To lastrow
ie.Visible = True
ie.navigate "https://www.google.com.eg/search?q=baby&sxsrf=ALeKk01tyfvvxyYjaC0YctjxaY0RlvPnuw:1586804351129&source=lnms&tbm=isch&sa=X&ved=2ahUKEwjB77TtiuboAhUl5uAKHR5KA2wQ_AUoAXoECBQQAw&biw=1280&bih=881"
While ie.Busy Or ie.readyState < 4: DoEvents: Wend
'querySelectorAll("a.VFACy.kGQAp")
Dim post As Object
Set post = ie.document.querySelectorAll("a.VFACy.kGQAp")
For j = 0 To post.Length - 1
Debug.Print post.Item(i).innerHTML
Next j
Next
End Sub
But in results I got the same innerhtml for all
<div class="sMi44c lNHeqe"><div class="WGvvNb" dir="ltr">Baby colic - Wikipedia</div><div class="fxgdke"><span dir="ltr">en.wikipedia.org</span></div>
</div>
Is using QuerySelectorAll different when dealing with IE?
** Another try
Dim post As Object
Set post = ie.document.querySelectorAll(".bRMDJf img")
Dim r As Long
For j = 0 To post.Length - 1
r = r + 1
Cells(r, 1).Value = post.Item(i).getAttribute("src")
Next j
Now I got 100 but not the links, it is base64 encryption for the pictures, moreover I found out the output is the same for all the pictures. I could decrypt the pictures but the quality is low .. and I got only 100 .. How can I increase the number of results and get the correct links?

Web Scraping - VBA

I am trying to scrape data from a website without any luck. i manage to navigate through Elements but I haven't managed to get the information from the last Elements. Below is my code, any help would be appreciated.
Option Explicit
Sub Download_Historical_Data()
Dim IE As InternetExplorer, doc As HTMLDocument
Dim All_Matches, Match
Dim All_Champions, Champion
'Open Browser and download data
Set IE = New InternetExplorer
With IE
.Visible = True
.Navigate ("https://www.scorespro.com/soccer/results/")
While .Busy Or .readyState < 4: DoEvents: Wend
Set doc = .document
End With
Set All_Champions = doc.getElementById("matches-data").getElementsByClassName("compgrp")
For Each Champion In All_Champions
Set All_Matches = Champion.getElementsByTagName("table")
For Each Match In All_Matches
If Left(Match.className, 12) = "blocks gteam" Then
With Match
'All the info
End With
End If
Next Match
Next Champion
IE.Quit
Set IE = Nothing
End Sub
Sample on 9/8/19:
Sample on 7/8/19:
Output:
The reason i have use as sample 2 different days is because there is a game with penalties and i want to include this as well.
You don't need to automate a browser. If you inspect the network traffic when selecting a date you will see an XHR request for the info. You can use those details (in fact I shorten to just the required url params) to retrieve the page content.
The info is contained in table tag elements. The champion is in tables with class name blockBar, otherwise the info is for the row info as seen on page. In order to leverage querySelector (which is a method of HTMLDocument) to select the sub table level elements, by class name, for individual tables, I stick the individual table html into a surrogate html document variable; I then have access to querySelector again and so can write nice flexible/descriptive css selectors to match on elements.
The columns in your output all have nice descriptive class names in the XHR response, so you can use those to determine which column to write to. As score info may risk losing formatting on output I use a Select Case statement, to test for those css selectors, and append a single quote to preserve formatting on output.
I choose, for efficiency, to store all results in an array and write out in one go.
Option Explicit
Public Sub GetMatchInfo()
Dim headers(), results(), r As Long, c As Long, ws As Worksheet, i As Long
Dim champion As String, html As HTMLDocument, html2 As HTMLDocument, cssSelectors(), j As Long
Set html = New HTMLDocument
Set html2 = New HTMLDocument
Set ws = ThisWorkbook.Worksheets("Sheet1")
headers = Array("Date", "Time", "Status", "Champion", "Home Team", "Full Time Score", "Away Team", "Half Time", "Penalties Score")
cssSelectors = Array(".kick_t_dt", ".kick_t_ko", ".status", "champion", ".home", ".score_link", ".away", ".halftime", ".after_pen")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.scorespro.com/soccer/ajax-calendar.php?mode=results&date=2019-08-07", False
.send
html.body.innerHTML = .responseText
End With
Dim tables As Object, selector As String
Set tables = html.querySelectorAll("table")
ReDim results(1 To tables.Length, 1 To UBound(headers) + 1)
For i = 0 To tables.Length - 1
If tables.item(i).className = "blockBar" Then
champion = tables.item(i).innerText
Else
r = r + 1
html2.body.innerHTML = tables.item(i).outerHTML
On Error Resume Next
For j = LBound(cssSelectors) To UBound(cssSelectors)
selector = cssSelectors(j)
Select Case selector
Case ".score_link", ".halftime", ".after_pen"
results(r, j + 1) = "'" & html2.querySelector(cssSelectors(j)).innerText
Case "champion"
results(r, j + 1) = champion
Case Else
results(r, j + 1) = html2.querySelector(cssSelectors(j)).innerText
End Select
Next
On Error GoTo 0
End If
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub
Example sample output:
Using IE
Option Explicit
Public Sub GetMatchInfo()
Dim headers(), results(), r As Long, c As Long, ws As Worksheet, i As Long
Dim champion As String, html As HTMLDocument, html2 As HTMLDocument, cssSelectors(), j As Long
Set html = New HTMLDocument
Set html2 = New HTMLDocument
Set ws = ThisWorkbook.Worksheets("Sheet1")
headers = Array("Date", "Time", "Status", "Champion", "Home Team", "Full Time Score", "Away Team", "Half Time", "Penalties Score")
cssSelectors = Array(".kick_t_dt", ".kick_t_ko", ".status", "champion", ".home", ".score_link", ".away", ".halftime", ".after_pen")
With CreateObject("InternetExplorer.Application")
.Navigate2 "https://www.scorespro.com/soccer/results/"
While .Busy Or .readyState <> 4: DoEvents: Wend
Application.Wait Now + TimeSerial(0, 0, 2)
html.body.innerHTML = .document.body.innerHTML
.Quit
End With
Dim tables As Object, selector As String
Set tables = html.querySelectorAll("table")
ReDim results(1 To tables.Length, 1 To UBound(headers) + 1)
For i = 0 To tables.Length - 1
If tables.item(i).className = "blockBar" Then
champion = tables.item(i).innerText
Else
r = r + 1
html2.body.innerHTML = tables.item(i).outerHTML
On Error Resume Next
For j = LBound(cssSelectors) To UBound(cssSelectors)
selector = cssSelectors(j)
Select Case selector
Case ".score_link", ".halftime", ".after_pen"
results(r, j + 1) = "'" & html2.querySelector(cssSelectors(j)).innerText
Case "champion"
results(r, j + 1) = champion
Case Else
results(r, j + 1) = html2.querySelector(cssSelectors(j)).innerText
End Select
Next
On Error GoTo 0
End If
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub
I wrote this in WSL (web scraping language) but basically you can edit the json to add any other fields (assuming all the football games). Once you got all the data, then you can either have it emailed to you or your web server.
GOTO www.scorespro.com/soccer/results/ >>
EXTRACT {'time': '.kick_t', 'status':'.status',
'home':'.home.uc', 'score':'.score', 'away':'.away', 'match':'a'} IN table tr
Explanation: it goes to that score page, and then pulls time, status, home, score, away fields for each table row via table tr and finally the match field which will come from the header bar table row. It will look like {'time':undefined, ...., 'match':'Armenia: Premier League'} along with other table row game schedules like {'time':'2019/8/21' ,..., 'match':undefined}. Just merge the JSON objects afterwards.

VBA Scraping div elements

So, I've trying to scrape data from a website but I simply can't reach my goal...
I'm new with VBA and i've tried to search the basics of vba in order to understand some code.
So far I got this code but it's only scraping the data from the 1st div and it scrap all the data to one cell, and I need the macro to run trought all the page and scrap all the data that has the className I input on the code on diferent cells (eg: 1st div to cell A:1, 2nd div to cell A2... and so on)
Could you help me or give me some "lights" of what I'm doing wrong pls?
Thank you!
Code:
Sub BoschRoupa()
Dim ieObj As InternetExplorer
Dim htmlEle As IHTMLElement
Dim i As Integer
i = 1
Set ieObj = New InternetExplorer
ieObj.Visible = False
ieObj.navigate "https://www.worten.pt/grandes-eletrodomesticos/maquinas-de-roupa/maquinas-de-roupa-ver-todos-marca-BALAY-e-BOSCH-e-SIEMENS?per_page=100"
Application.Wait Now + TimeValue("00:00:05")
For Each htmlEle In ieObj.document.getElementsByClassName("w-product__content")(0).getElementsByTagName("div")
With ActiveSheet
.Range("A" & i).Value = htmlEle.Children(0).textContent
End With
i = i + 1
Next htmlEle
End Sub
You can use xmlhttp, rather than a browser, then the following loop to write out all the div info. I would probably be more selective in how I grab only data of interest but the following, I hope, is in the spirit of what you have asked for.
Option Explicit
Public Sub GetInfo()
Dim data As Object, i As Long, html As HTMLDocument, r As Long, c As Long, item As Object, div As Object
Set html = New HTMLDocument '<== VBE > Tools > References > Microsoft HTML Object Library
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.worten.pt/grandes-eletrodomesticos/maquinas-de-roupa/maquinas-de-roupa-ver-todos-marca-BALAY-e-BOSCH-e-SIEMENS?per_page=100", False
.send
html.body.innerHTML = .responseText
End With
Set data = html.getElementsByClassName("w-product__content")
For Each item In data
r = r + 1: c = 1
For Each div In item.getElementsByTagName("div")
With ThisWorkbook.Worksheets("Sheet1")
.Cells(r, c) = div.innerText
End With
c = c + 1
Next
Next
End Sub

web scraping using excel and VBA

i wrote my VBA code in excel sheet as below but it is not scrape data for me and also i don't know why please any one help me. it gave me reullt as "click her to read more" onlyi want to scrape enitre data such as first name last name state zip code and so on
Sub extractTablesData()
Dim IE As Object, obj As Object
Dim myState As String
Dim r As Integer, c As Integer, t As Integer
Dim elemCollection As Object
Set IE = CreateObject("InternetExplorer.Application")
myState = InputBox("Enter the city where you wish to work")
With IE
.Visible = True
.navigate ("http://www.funeralhomes.com/go/listing/Search? name=&city=&state=&country=USA&zip=&radius=")
While IE.readyState <> 4
DoEvents
Wend
For Each obj In IE.document.all.item("state").Options
If obj.innerText = myState Then
obj.Selected = True
End If
Next obj
IE.document.getElementsByValue("Search").item.Click
Do While IE.Busy: DoEvents: Loop
ThisWorkbook.Sheets("Sheet1").Range("A1:K1500").ClearContents
Set elemCollection = IE.document.getElementsByTagName("TABLE")
For t = 0 To (elemCollection.Length - 1)
For r = 0 To (elemCollection(t).Rows.Length - 1)
For c = 0 To (elemCollection(t).Rows(r).Cells.Length - 1)
ThisWorkbook.Worksheets(1).Cells(r + 1, c + 1) = elemCollection(t).Rows(r).Cells(c).innerText
Next c
Next r
Next t
End With
Set IE = Nothing
End Sub
Using the same URL as the answer already given you could alternatively select with CSS selectors to get the elements of interest, and use split to get just the names and address parts from the text. We can also do away with the browser altogether to get faster results from first results page.
Business name:
You can get the name with the following selector (using paid listing example):
div.paid-listing .listing-title
This selects (sample view)
Try
Address info:
The associated descriptive information can be retrieved with the selector:
div.paid-listing .address-summary
And then using split we can parse this into just the address information.
Code:
Option Explicit
Public Sub GetTitleAndAddress()
Dim oHtml As HTMLDocument, nodeList1 As Object, nodeList2 As Object, i As Long
Const URL As String = "http://www.funeralhomes.com/go/listing/ShowListing/USA/New%20York/New%20York"
Set oHtml = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", URL, False
.send
oHtml.body.innerHTML = .responseText
End With
Set nodeList1 = oHtml.querySelectorAll("div.paid-listing .listing-title")
Set nodeList2 = oHtml.querySelectorAll("div.paid-listing .address-summary")
With Worksheets("Sheet3")
.UsedRange.ClearContents
For i = 0 To nodeList1.Length - 1
.Range("A" & i + 1) = nodeList1.Item(i).innerText
.Range("B" & i + 1) = Split(nodeList2.Item(i).innerText, Chr$(10))(0)
Next i
End With
End Sub
Example output:
Yeah, without an API, this can be very tricky at best, and very inconsistent at worst. For now, you can try the script below.
Sub DumpData()
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
URL = "http://www.funeralhomes.com/go/listing/ShowListing/USA/New%20York/New%20York"
'Wait for site to fully load
IE.Navigate2 URL
Do While IE.Busy = True
DoEvents
Loop
RowCount = 1
With Sheets("Sheet1")
.Cells.ClearContents
RowCount = 1
For Each itm In IE.document.all
If itm.classname Like "*free-listing*" Or itm.classname Like "*paid-listing*" Then
.Range("A" & RowCount) = itm.classname
.Range("B" & RowCount) = Left(itm.innertext, 1024)
RowCount = RowCount + 1
End If
Next itm
End With
End Sub
You probably want some kind of input box to capture the city and state and radius from the user, or capture those variable in cells in your worksheet.
Notice, the '%20' is a space character.
I got this idea from a friend of mine, Joel, a long time ago. That guy is great!

Resources