Excel Macro to draw thread comments from website into cells - excel

I am trying to store Reddit thread comments in an excel spreadsheet, however I have had trouble trying to figure out how to do this. I do not have much experience with using macros to get data from webpages, so I have been finding it hard to figure out how exactly to draw out each comment from a specified Reddit thread and place it in a cell, and whether or not it is possible to do.
This is what I have so far:
Sub getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
With htm.getelementbyid("comments")
Set cellrangex = .Rows(x).Cells.Length - 1
Set cellrangey = .Rows(x).Cells.Length - 1
Set cellrange1 = Sheets(1).Cells(x + 1, y + 1).Value
Set cellrange2 = .Rows(x).Cells(y).innertext
For x = 0 To cellrangex
For y = 0 To cellrangey
cellrange = cellrange2
Next y
Next x
End With
End Sub

You'll really need to analyze the contents of the web page you are scraping with a decent HTML editor. I would suggest navigating to the page in question in chrome and using F12 to open it's developer tool. In the "Elements" tab you can quickly see which HTML is producing which part of the page (open both the page and the developer tools next to each other).
You'll notice as you head into the comments that the text of each comment is inside a <p> tag and each <p> tag is inside a <div>. We are looking for patterns, so this is a good start.
You'll also notice that each one of those <div> tags has a class of md.
So... Lets load all of the pages <div> tags into an object and then look for the ones that have a className that contains "md":
Sub getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
Set Divelements = htm.getElementsByTagName("div")
For Each DivElement In Divelements
If InStr(1, DivElement.ClassName, "md") Then
'print contents to the Immediate window for debugging View>>Immediate Window to insure it's up in your VBE
Debug.Print DivElement.InnerText
End If
Next
End Sub
With that you'll see all of the comments stuck in the Immediate window (go to View>>Immediate Window) so you can see this debug output.
After skipping around the nodes it looks like you can navigate up a couple of elements and back down the tree to get the username:
Sub getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
Set Divelements = htm.getElementsByTagName("div")
On Error Resume Next
For Each divElement In Divelements
If InStr(1, divElement.className, "md") And Not InStr(1, divElement.className, "md-container") Then
Set commentEntry = divElement.ParentNode.ParentNode.ParentNode
'Print the name and the comment
Debug.Print commentEntry.FirstChild.FirstChild.NextSibling.InnerText & ":", divElement.InnerText
End If
Next
End Sub
To print this out to the sheet just point to a cell instead of the debug.print immediate window. Something like:
Sub getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Dim ws As Worksheet, wsCell As Integer
'set the worksheet to print to and the first row to start printing.
Set ws = Sheets("Sheet1")
wsCell = 1
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
Set Divelements = htm.getElementsByTagName("div")
On Error Resume Next
For Each divElement In Divelements
If InStr(1, divElement.className, "md") And Not InStr(1, divElement.className, "md-container") Then
Set commentEntry = divElement.ParentNode.ParentNode.ParentNode
'Print the name and the comment to ws sheet columns 1 and 2
ws.Cells(wsCell, 1).Value = commentEntry.FirstChild.FirstChild.NextSibling.InnerText
ws.Cells(wsCell, 2).Value = divElement.InnerText
'iterate to the next row
wsCell = wsCell + 1
End If
Next
End Sub

Related

How to skip a row in Excel with missing html tag using VBA

There are 15 objects listed on this website, each has a link under the photo. The 6th object has none. When extracting and transferring the content with my code the missing html-href is not skipped and in Excel, 14 hrefs are listed below each other (the 6th cell should remain empty or "no ducument"), but the last cell does (& error because 14<=>15). Unfortunately I have to keep my code structure and just need a loop or condition to complete it. Does anyone have any ideas? Thanks.
My incomplete code:
Public Sub GetData()
Dim html As New HTMLDocument
Dim elmt01 As Object, elmt02 As Object
Dim y As Long
Dim xURL As String
Set html = New MSHTML.HTMLDocument
xURL = "https://immobilienpool.de/suche/immobilien?page=1"
With CreateObject("MSXML2.XMLHTTP.6.0")
.Open "GET", xURL, False
.send
html.body.innerHTML = .responseText
End With
Set elmt01 = html.querySelectorAll("li[class*='contentBox']") '15 items
Set elmt02 = html.querySelectorAll("li a[title*='zusätzliche']") '14 hrefs
For y = 0 To elmt01.Length - 1
If InStr(elmt02, "pdf") Then 'better: If elmt02 exists in elmt01 then...
ActiveSheet.Cells(y + 1, 2) = elmt02.Item(y).href
Else
ActiveSheet.Cells(y + 1, 2) = "No document"
End If
Next
End Sub
The following script should solve the issue you are having. I had to modify your code to skip the blank row. I hope you will be able to comply with the current version:
Public Sub GetData()
Dim Html As HTMLDocument, HTMLDoc As HTMLDocument
Dim oPdfLink As Object, xURL As String, I As Long
Set Html = New MSHTML.HTMLDocument
Set HTMLDoc = New MSHTML.HTMLDocument
xURL = "https://immobilienpool.de/suche/immobilien?page=1"
With CreateObject("MSXML2.XMLHTTP.6.0")
.Open "GET", xURL, False
.send
Html.body.innerHTML = .responseText
End With
With Html.querySelectorAll("li[class*='contentBox']")
For I = 0 To .Length - 1
HTMLDoc.body.innerHTML = .item(I).outerHTML
Set oPdfLink = HTMLDoc.querySelector("a[title*='zusätzliche']")
If Not oPdfLink Is Nothing Then
ActiveSheet.Cells(I + 1, 2) = oPdfLink.href
Else:
ActiveSheet.Cells(I + 1, 2) = "No document"
End If
Next I
End With
End Sub

VBA Scraping div elements

So, I've trying to scrape data from a website but I simply can't reach my goal...
I'm new with VBA and i've tried to search the basics of vba in order to understand some code.
So far I got this code but it's only scraping the data from the 1st div and it scrap all the data to one cell, and I need the macro to run trought all the page and scrap all the data that has the className I input on the code on diferent cells (eg: 1st div to cell A:1, 2nd div to cell A2... and so on)
Could you help me or give me some "lights" of what I'm doing wrong pls?
Thank you!
Code:
Sub BoschRoupa()
Dim ieObj As InternetExplorer
Dim htmlEle As IHTMLElement
Dim i As Integer
i = 1
Set ieObj = New InternetExplorer
ieObj.Visible = False
ieObj.navigate "https://www.worten.pt/grandes-eletrodomesticos/maquinas-de-roupa/maquinas-de-roupa-ver-todos-marca-BALAY-e-BOSCH-e-SIEMENS?per_page=100"
Application.Wait Now + TimeValue("00:00:05")
For Each htmlEle In ieObj.document.getElementsByClassName("w-product__content")(0).getElementsByTagName("div")
With ActiveSheet
.Range("A" & i).Value = htmlEle.Children(0).textContent
End With
i = i + 1
Next htmlEle
End Sub
You can use xmlhttp, rather than a browser, then the following loop to write out all the div info. I would probably be more selective in how I grab only data of interest but the following, I hope, is in the spirit of what you have asked for.
Option Explicit
Public Sub GetInfo()
Dim data As Object, i As Long, html As HTMLDocument, r As Long, c As Long, item As Object, div As Object
Set html = New HTMLDocument '<== VBE > Tools > References > Microsoft HTML Object Library
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.worten.pt/grandes-eletrodomesticos/maquinas-de-roupa/maquinas-de-roupa-ver-todos-marca-BALAY-e-BOSCH-e-SIEMENS?per_page=100", False
.send
html.body.innerHTML = .responseText
End With
Set data = html.getElementsByClassName("w-product__content")
For Each item In data
r = r + 1: c = 1
For Each div In item.getElementsByTagName("div")
With ThisWorkbook.Worksheets("Sheet1")
.Cells(r, c) = div.innerText
End With
c = c + 1
Next
Next
End Sub

Object does not support this property of method, while parsing html document

I want to scrap every post heading form this blog. I am using the code bellow but it's giving me an error "Run time error 438 object does not support this property or method" in line
Cells(i, 1).Value = ele.getElementsByClassName("entry-title")(0).getElementsByTagName("a")(0).innerText
The code is:
Private Sub CommandButton1_Click()
Dim bot As Object
Dim doc As New HTMLDocument
Dim ele As HTMLElementCollection
Dim i As Long
Set bot = CreateObject("MSXML2.XMLHTTP")
bot.Open "GET", "http://themakeupblogger.com/makeup/", False
bot.send
doc.body.innerHTML = bot.responseText
For Each ele In doc.getElementsByTagName("article")
i = Cells(Rows.Count, 1).End(xlUp).Row + 1
Cells(i, 1).Value = ele.getElementsByClassName("entry-title")(0).getElementsByTagName("a")(0).innerText
Next ele
End Sub
Give this a shot and get all the titles you are after.
Sub demo()
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim r As Long, ele As Object
With http
.Open "GET", "http://themakeupblogger.com/makeup/", False
.send
html.body.innerHTML = .responseText
End With
For Each elem In html.getElementsByClassName("entry-title")
With elem.getElementsByTagName("a")
If .Length Then r = r + 1: Cells(r, 1) = .Item(0).innerText
End With
Next elem
End Sub
Reference to add to the library:
1. Microsoft XML, v6.0
2. Microsoft HTML Object Library
Partial results:
4 High-Coverage Foundations That Might As Well Be Skincare
10 Memorial Day Beauty Essentials That Belong In Your Beach Bag
Don’t Get Married Without These Wedding Day Makeup Tips (Courtesy of a Makeup Artist)
To get the articles from that page you can do something like:
Sub demo()
Dim http As New InternetExplorer, html As New HTMLDocument
Dim r As Long, elem As Object
With http
.Visible = False
.navigate "http://themakeupblogger.com/makeup/"
Do Until .readyState = READYSTATE_COMPLETE: Loop
Set html = .document
End With
For Each elem In html.getElementsByTagName("article")
With elem.getElementsByTagName("h1")
If .Length Then r = r + 1: Cells(r, 1) = .Item(0).getElementsByTagName("a")(0).innerText
End With
With elem.getElementsByTagName("div")(3).getElementsByTagName("p")
If .Length Then Cells(r, 2) = .Item(0).innerText
End With
Next elem
End Sub
This time the reference you should add to the library:
1. Microsoft Internet Controls
2. Microsoft HTML Object Library

Excel VBA code to get link and click it

This is a screenshot of a link which I want VBA Excel to click:
I am using this code in VBA Excel after navigating to the required page but it's not getting that link which I have shown above in the picture.
set Alllinks=objIE.document.getallelementsbytagname("a")
For Each link In Alllinks
'MsgBox link.innertext & " - " & link.href
If InStr(link.innerText, "ABERCROMBIE JOE R") > 0 Then
link.Click
Exit For
End If
Next link
Modify your code like this:
Dim httpObject As Object
Set httpObject = CreateObject("MSXML2.XMLHTTP")
Dim doc As Object
Set doc = CreateObject("htmlfile")
Dim links As Variant
With httpObject
.Open "GET", "http://www.deltacomputersystems.com/cgi-lra2/LRMCGI01?HTMCNTY=AL39&HTMBASE=C&HTMSEARCH=BEGIN&HTMNAME=ABERCROMBIE+JOE+R&HTMADDRNUMBER=&HTMADDRSTREET=&HTMPARCEL1=&HTMPARCEL2=&HTMPARCEL3=&HTMPARCEL4=&HTMPARCEL5=&HTMPARCEL6=&HTMPARCEL7=&HTMPARCEL8=&HTMPPIN=&HTMSUBMIT=Submit", False
.send
Do Until httpObject.ReadyState = 4
Loop
doc.body.innerhtml = .responseText
Set links = doc.getElementsByTagName("a")
MsgBox (links(0).href)
End With
Here is the output

web scraping using excel and VBA

i wrote my VBA code in excel sheet as below but it is not scrape data for me and also i don't know why please any one help me. it gave me reullt as "click her to read more" onlyi want to scrape enitre data such as first name last name state zip code and so on
Sub extractTablesData()
Dim IE As Object, obj As Object
Dim myState As String
Dim r As Integer, c As Integer, t As Integer
Dim elemCollection As Object
Set IE = CreateObject("InternetExplorer.Application")
myState = InputBox("Enter the city where you wish to work")
With IE
.Visible = True
.navigate ("http://www.funeralhomes.com/go/listing/Search? name=&city=&state=&country=USA&zip=&radius=")
While IE.readyState <> 4
DoEvents
Wend
For Each obj In IE.document.all.item("state").Options
If obj.innerText = myState Then
obj.Selected = True
End If
Next obj
IE.document.getElementsByValue("Search").item.Click
Do While IE.Busy: DoEvents: Loop
ThisWorkbook.Sheets("Sheet1").Range("A1:K1500").ClearContents
Set elemCollection = IE.document.getElementsByTagName("TABLE")
For t = 0 To (elemCollection.Length - 1)
For r = 0 To (elemCollection(t).Rows.Length - 1)
For c = 0 To (elemCollection(t).Rows(r).Cells.Length - 1)
ThisWorkbook.Worksheets(1).Cells(r + 1, c + 1) = elemCollection(t).Rows(r).Cells(c).innerText
Next c
Next r
Next t
End With
Set IE = Nothing
End Sub
Using the same URL as the answer already given you could alternatively select with CSS selectors to get the elements of interest, and use split to get just the names and address parts from the text. We can also do away with the browser altogether to get faster results from first results page.
Business name:
You can get the name with the following selector (using paid listing example):
div.paid-listing .listing-title
This selects (sample view)
Try
Address info:
The associated descriptive information can be retrieved with the selector:
div.paid-listing .address-summary
And then using split we can parse this into just the address information.
Code:
Option Explicit
Public Sub GetTitleAndAddress()
Dim oHtml As HTMLDocument, nodeList1 As Object, nodeList2 As Object, i As Long
Const URL As String = "http://www.funeralhomes.com/go/listing/ShowListing/USA/New%20York/New%20York"
Set oHtml = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", URL, False
.send
oHtml.body.innerHTML = .responseText
End With
Set nodeList1 = oHtml.querySelectorAll("div.paid-listing .listing-title")
Set nodeList2 = oHtml.querySelectorAll("div.paid-listing .address-summary")
With Worksheets("Sheet3")
.UsedRange.ClearContents
For i = 0 To nodeList1.Length - 1
.Range("A" & i + 1) = nodeList1.Item(i).innerText
.Range("B" & i + 1) = Split(nodeList2.Item(i).innerText, Chr$(10))(0)
Next i
End With
End Sub
Example output:
Yeah, without an API, this can be very tricky at best, and very inconsistent at worst. For now, you can try the script below.
Sub DumpData()
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
URL = "http://www.funeralhomes.com/go/listing/ShowListing/USA/New%20York/New%20York"
'Wait for site to fully load
IE.Navigate2 URL
Do While IE.Busy = True
DoEvents
Loop
RowCount = 1
With Sheets("Sheet1")
.Cells.ClearContents
RowCount = 1
For Each itm In IE.document.all
If itm.classname Like "*free-listing*" Or itm.classname Like "*paid-listing*" Then
.Range("A" & RowCount) = itm.classname
.Range("B" & RowCount) = Left(itm.innertext, 1024)
RowCount = RowCount + 1
End If
Next itm
End With
End Sub
You probably want some kind of input box to capture the city and state and radius from the user, or capture those variable in cells in your worksheet.
Notice, the '%20' is a space character.
I got this idea from a friend of mine, Joel, a long time ago. That guy is great!

Resources