Scrape an innertext from Wikipedia's infobox - excel

Doing a project that requires using a command button should extract some info from a webpage.
I chose to have 'Confirmed Cases' from Wikipedia's infobox of the COVID-19 pandemic, I can't seem to write the correct "getElementsBy.." to have the info extracted.
Private Sub AccesswebsitePAN_Click()
WB_PAN.Navigate ("https://en.wikipedia.org/wiki/COVID-19_pandemic")
End Sub
Private Sub GetinfoPAN_Click()
Cells(2, 1).Value = WB_PAN.Document.getElementsById("mw-content-text").getElementsByClassName("mw-parser-output").getElementsByTagName("table").getElementsByTagName("tbody").getElementsByTagName("tr")(12).getElementsByTagName("td")(0).Innertext
End Sub
Private Sub WB_PAN_StatusTextChange(ByVal Text As String)
End Sub

As ID is a unique locator, you can try using the same within your script to fetch the result.
Option Explicit
Sub GetCaseCount()
Const Url$ = "https://en.wikipedia.org/wiki/COVID-19_pandemic"
Dim Http As Object, elem As Object, S$
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", Url, False
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
.send
S = .responseText
End With
With CreateObject("htmlfile")
.body.innerHTML = S
Set elem = .getElementById("cite_ref-JHU_ticker_5-0")
Debug.Print elem.ParentNode.innerText
End With
End Sub
The script should work without adding any reference to the library.

You have neglected at several stages to index into collections returned by getElementsBy methods. Also, it is getElementById. I might go for a more readable shorter path that selects the table by its class (using querySelector) then use the .rows method.
Your way:
Debug.Print WB_PAN.Document.getElementById("mw-content-text").getElementsByClassName("mw-parser-output")(0).getElementsByTagName("table")(0).getElementsByTagName("tbody")(0).getElementsByTagName("tr")(18).getElementsByTagName("td")(0).Innertext
Alternative:
Option Explicit
Public Sub CovidCases()
Dim ie As SHDocVw.InternetExplorer
Set ie = New SHDocVw.InternetExplorer
With ie
.Visible = True
.Navigate2 "https://en.wikipedia.org/wiki/COVID-19_pandemic"
While .Busy Or .ReadyState <> READYSTATE_COMPLETE: DoEvents: Wend
Debug.Print .Document.querySelector(".infobox").Rows(12).innertext
Debug.Print .Document.querySelector(".infobox").Rows(12).firstchild.nextsibling.innertext
.Quit
End With
End Sub
Slightly cleaner is to clear out, using an approach from user4596341, the node html that adds the lagging [4] reference before taking your cases, such that you have a more readable path
Option Explicit
Public Sub CovidCases()
Dim ie As SHDocVw.InternetExplorer
Set ie = New SHDocVw.InternetExplorer
With ie
.Visible = True
.Navigate2 "https://en.wikipedia.org/wiki/COVID-19_pandemic"
While .Busy Or .ReadyState <> READYSTATE_COMPLETE: DoEvents: Wend
.Document.querySelector(".infobox tr:nth-of-type(13) sup").innerHTML = vbNullString
Debug.Print .Document.querySelector(".infobox").Rows(12).Cells(1).innertext
.Quit
End With
End Sub
Seeing #Sim's answer reminded me you can walk back up the DOM to get just the case count which is even cleaner:
Option Explicit
Public Sub CovidCases()
Dim ie As SHDocVw.InternetExplorer
Set ie = New SHDocVw.InternetExplorer
With ie
.Visible = True
.Navigate2 "https://en.wikipedia.org/wiki/COVID-19_pandemic"
While .Busy Or .ReadyState <> READYSTATE_COMPLETE: DoEvents: Wend
Debug.Print .Document.querySelector(".infobox tr:nth-of-type(13) sup").PreviousSibling.NodeValue
.Quit
End With
End Sub

Related

Get custom element via screenscraping with VBA

I would like to screenscrape some prices from yahoo finance for some stocks in my excel sheet.
My approach is to use:
Function Scrape()
Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
With appIE
.Navigate "https://de.finance.yahoo.com/quote/TSLA"
.Visible = True
End With
Do While appIE.Busy
DoEvents
Loop
Set data = appIE.document.getElementById("data-reactid") #this is the point where I'm stuck
End Function
The question I have is how to get the custom elements such as:
<span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid="32">1.025,05</span>
The site seems to use reactid for every element which makes it easy to pinpoint the elements. How would I go by doing that for the above example data-reactid="32"
Thanks
You can try using xhr because the content you look for is available in the page source. This is one of the efficient ways how you can go:
Public Sub GetPrice()
Const Url$ = "https://de.finance.yahoo.com/quote/TSLA"
Dim S$, itemPrice$
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", Url, False
.send
S = .responseText
End With
With CreateObject("HTMLFile")
.write S
itemPrice = .getElementById("quote-market-notice").ParentNode.FirstChild.innerText
MsgBox itemPrice
End With
End Sub

VBA HTML elements to Excel

I am working on a code that uses VBA-Excel to navigate to a website and copy some values to Excel.
I can open the website and navigate, but I can't save the "Precipitation" values in excel sheet
Sub accuweather()
Dim ie As InternetExplorer
Dim pagePiece As Object
Dim webpage As HTMLDocument
Set ie = New InternetExplorer
ie.Visible = True
ie.navigate ("http://www.accuweather.com/en/pt/abadia/869773/daily-weather-forecast/869773?day=2")
Do While ie.readyState = 4: DoEvents: Loop
Do Until ie.readyState = 4: DoEvents: Loop
While ie.Busy
DoEvents
Wend
Set webpage = ie.document
Set mtbl = webpage.getElementsByTagName("details-card card panel details allow-wrap")
Set table_data = mtbl.getElementsByTagName("div")(1)
For itemNum = 1 To 240
For childNum = 0 To 5
Cells(itemNum, childNum + 1) = table_data.Item(itemNum).Children(childNum).innerText
Next childNum
Next itemNum
ie.Quit
Set ie = Nothing
End Sub
The method you are using is getElementsByTagName but the reference is for a multi-valued class. So the correct method would be getElementsByClassName.
However, you don't need the browser as that content is static and you can just use faster xmlhttp request and a single (more robust and faster) class to target.
This
html.querySelectorAll(".list")
is retrieving the two parent nodes which have the various p tag children. The first child in both cases
.Item(i).FirstChild
is the precipitation info.
Option Explicit
Public Sub GetPrecipitationValues()
Dim html As MSHTML.HTMLDocument, i As Long
Set html = New MSHTML.HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.accuweather.com/en/pt/abadia/869773/daily-weather-forecast/869773?day=2", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
html.body.innerHTML = .responseText
End With
With html.querySelectorAll(".list")
For i = 0 To .Length - 1
Debug.Print .Item(i).FirstChild.innerText
Next
End With
End Sub

Why does my code to Scrape Text using VBA works in Debug only

I have written some code to scrape specific dates from Google's patent website. After reviewing lots of examples I figured out the getElementsByClassName that gets the date I need. The code below works when I step through in debug mode and generates the desired MsgBox. But when I run it, it gives me "Run-time error '91': Object variable or With block variable not set."
I have added delays wherever I thought that might be an issue. I have also disassociated the code from any interaction with the Excel spreadsheet where I would ultimately put the date, just to make it as simple as possible. I've also copied the code from the original spreadsheet to a new blank one, but same issue.
Any help would be appreciated.
Sub Get_Date()
Dim ie As InternetExplorer
Dim sURL As String
Dim strGrant As Variant
Set ie = New InternetExplorer
sURL = "https://patents.google.com/patent/US6816842B1/en?oq=6816842"
ie.navigate sURL
ie.Visible = False
Do While ie.Busy Or ie.ReadyState < 4
DoEvents
Loop
strGrant = ie.document.getElementsByClassName("granted style-scope application-timeline")(0).innerText
Do While ie.Busy Or ie.ReadyState < 4
DoEvents
Loop
MsgBox strGrant
ie.Quit
End Sub
````
It's likely a timing issue as per my comment. That's dealt with in other answers to similar questions. Main things to consider are:
Use proper page load waits: While IE.Busy Or ie.readyState < 4: DoEvents: Wend
Possibly a timed loop to attempt to set the element to a variable then testing if set.
Alternatively, a bit of a punt but it seems that all granted dates are the same as publication dates (patent publication date). If this is true then you can use xhr to get the publication date
Option Explicit
Public Sub GetDates()
Dim html As HTMLDocument, i As Long, patents()
patents = Array("US7724240", "US6876312", "US8259073", "US7523862", "US6816842B1")
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
For i = LBound(patents) To UBound(patents)
.Open "GET", "https://patents.google.com/patent/" & patents(i) & "/en?oq=" & patents(i), False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
html.body.innerHTML = .responseText
If html.querySelectorAll("[itemprop=publicationDate]").length > 0 Then
Debug.Print html.querySelector("[itemprop=publicationDate]").DateTime
End If
Next
End With
End Sub

how to looping to download excels from href (when the href was same for all files)

Targeting to download excels through href. However, Not able to download all excels with a loop (when the href links are looking same). what can be down in this case? once downloading those files how to save each excel in local C folder?
Here is the required HTML href links
<td class="tenderlink">
<a href='/Documents/ProcurementDisposal/20190419102042818.xls'
target="_blank">View</a></td>
Code:
Sub download()
'Application.ScreenUpdating = False
Dim ie As InternetExplorer
Dim ele As Object
Set ie = New InternetExplorer
With ie
.Visible = True
.Navigate2 "http://www.nafed-india.com/Home/ProcDispoDetails"
While .Busy Or .readyState < 4: DoEvents: Wend
End With
For Each ele In ie.document.querySelector("a[href^='/Documents/ProcurementDisposal']").Click
Next ele
End Sub
Try something like this:
Sub download()
Dim ie As InternetExplorer
Dim el As Object, els As Object
Set ie = New InternetExplorer
With ie
.Visible = True
.Navigate2 "http://www.nafed-india.com/Home/ProcDispoDetails"
While .Busy Or .readyState < 4: DoEvents: Wend
End With
'Note: querySelectorAll unless you only want one element
Set els = ie.document.querySelectorAll("a[href^='/Documents/ProcurementDisposal']")
For Each el In els
Debug.Print el.href '<< pass this to Workbook.Open() then you can save the
' workbook where you need
Next
End Sub

getElementsBy() extract text

I'm really new to VBA and I've been trying to get the value below the Column "Impuesto".
I'm getting error 438. I still don't quite understand how to refer to a certain part of the web page.
Sub extract()
Dim myIE As Object
Dim myIEDoc As Object
Dim element As IHTMLElement
Set myIE = CreateObject("InternetExplorer.Application")
myIE.Visible = False
myIE.navigate "https://zonasegura1.bn.com.pe/TipoCambio/"
While myIE.Busy
DoEvents
Wend
Set myIEDoc = myIE.document
Range("B1") = myIEDoc.getElementsByID("movimiento")(0).getElementsByTagName("span")
End Sub
You need getElementsByClassName() not getElementsByID since the word movimiento is in <li class="movimiento bg"> Impuesto </li>
Range("B1") = myIEDoc.getElementsByClassName("movimiento")(0).getElementsByClassName("l2 valor")(0)
Edit:
Check out the tag if the tag name if <li>..</li> so you should getElementsByTagName("li")
Check out the tag if the tag contain id <li id="movimiento">..</li> so you should getElementByID("movimiento")
Check out the tag if the tag contain class <li class="movimiento">..</li> so you should getElementsByClassName("movimiento")
Try the below script. It should fetch you the data you are after. When the execution is done, you should find the value in Range("A1") in your spreadsheet.
Sub Get_Quote()
Dim post As Object
With CreateObject("InternetExplorer.Application")
.Visible = True
.navigate "https://zonasegura1.bn.com.pe/TipoCambio/"
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set post = .document.querySelector(".movimiento span.l2.valor")
[A1] = post.innerText
.Quit
End With
End Sub
It is faster to use XMLHTTP request as follows:
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, html As HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://zonasegura1.bn.com.pe/TipoCambio/", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Set html = New HTMLDocument
With html
.body.innerHTML = sResponse
Debug.Print .querySelector(".movimiento .l2.valor").innerText
End With
End Sub

Resources