Web Scraping - COVID19 incidents - excel

I have some questions reagrding an Excel VBA program that I want to build.
Basically it's pretty easy. I want to access the following website https://coronavirus.jhu.edu/map.html
and extract the Confirmed Cases by Country/Region/Sovereignty (it's the table on the very left of the dashborad) and paste the values in excel.
I know all the basic stuff on how to setup an internetexplorer instance and scraping the page by tags, classes, ids etc.
But I think in this sceanrio I cannot use the basic things. I guess it's pretty tricky actually.
The information I am looking for is within some tags. But I cannot get their textcontent when I use the getelementsbytagname("strong") approach.
Could someone help me in this case?
I am grateful for any hints, advices and solutions.
Below you'll find the start of my code.
Best
Simon
Sub test()
Dim ie As InternetExplorer
Dim html As HTMLDocument
Dim i As Integer
Dim obj_coll As IHTMLElementCollection
Dim obj As HTMLObjectElement
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate "https://coronavirus.jhu.edu/map.html"
Do Until ie.readyState = READYSTATE_COMPLETE
DoEvents
Loop
Debug.Print "Successfully connected with host"
Set html = ie.document
Set obj_coll = html.getElementsByTagName("strong")
For Each obj In obj_coll
Debug.Print obj.innerText
Next obj
ie.Quit
End Sub

You can use the iframe url direct to navigate to. You then need a timed wait to ensure the data has loaded within that iframe. I would then collect nodeLists via faster css selectors. As the nodeLists (one for figures and the other for locations) are the same length you will only need a single loop to index into both lists to get rows of data.
Option Explicit
Public Sub GetCovidFigures()
Dim ie As SHDocVw.InternetExplorer
Set ie = New SHDocVw.InternetExplorer
Dim t As Date
Const MAX_WAIT_SEC As Long = 30
With ie
.Visible = True
.Navigate2 "https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6"
Do
DoEvents
Loop While .Busy Or .readyState <> READYSTATE_COMPLETE
t = Timer
Do
If Timer - t > MAX_WAIT_SEC Then Exit Sub
Loop While .document.querySelectorAll(".feature-list strong").Length = 0
Dim figures As Object, location As Object, results(), i As Long
Set figures = .document.querySelectorAll("h5 strong")
Set location = .document.querySelectorAll("h5 span:last-child")
ReDim results(1 To figures.Length, 1 To 2)
For i = 0 To figures.Length - 1
results(i + 1, 1) = figures.item(i).innerText
results(i + 1, 2) = location.item(i).innerText
Next
.Quit
End With
ActiveSheet.Cells(1, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub
Consider how frequently you want this. There are large numbers of APIs popping up to supply this data which you could instead issue faster xhr requests to. Additionally, you could simply take the source data in csv form from github here. *Files after Feb 1 (UTC): once a day around 23:59 (UTC). There is a rest API visible in dev tools network tab that is frequently supplying new data in json format which is used to update the page. That can be accessed via Python + requests or R + httr modules for example. I suspect this endpoint is not intended to be hit so look for public APIs.

Related

Excel VBA - Web Scraping - Get value in HTML Table cell

I am trying to create a macro that scrapes a cargo tracking website.
But I have to create 4 such macros as each airline has a different website.
I am new to VBA and web scraping.
I have put together a code that works for 1 website. But when I tried to replicate it for another one, I am stuck in the loop. I think it maybe how I am referring to the element, but like I said, I am new to VBA and have no clue about HTML.
I am trying to get the "notified" value in the highlighted line from the image.
IMAGE:"notified" text to be extracted
Below is the code I have written so far that gets stuck in the loop.
Any help with this would be appreciated.
Sub FlightStat_AF()
Dim url As String
Dim ie As Object
Dim nodeTable As Object
'You can handle the parameters id and pfx in a loop to scrape dynamic numbers
url = "https://www.afklcargo.com/mycargo/shipment/detail/057-92366691"
'Initialize Internet Explorer, set visibility,
'call URL and wait until page is fully loaded
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = False
ie.navigate url
Do Until ie.readyState = 4: DoEvents: Loop
'Wait to load dynamic content after IE reports it's ready
'We can do that in a loop to match the point the information is available
Do
On Error Resume Next
Set nodeTable = ie.document.getElementByClassName("block-whisper")
On Error GoTo 0
Loop Until Not nodeTable Is Nothing
'Get the status from the table
MsgBox Trim(nodeTable.getElementsByClassName("fs-12 body-font-bold").innerText)
'Clean up
ie.Quit
Set ie = Nothing
Set nodeTable = Nothing
End Sub
Some basics:
For simple accesses, like the present ones, you can use the get methods of the DOM (Document Object Model). But there is an important difference between getElementByID() and getElementsByClassName() / getElementsByTagName().
getElementByID() searches for the unique ID of a html tag. This is written as the ID attribute to html tags. If the html standard is kept by the page, there is only one element with this unique ID. That's the reason why the method begins with getElement.
If the ID is not found when using the method, VBA throws a runtime error. Therefore the call is encapsulated in the loop from the other answer from me, into switching off and on again the error handling. But in the page from this question there is no ID for the html area in question.
Instead, the required element can be accessed directly. You tried the access with getElementsByClassName(). That's right. But here comes the difference to getElementByID().
getElementsByClassName() and getElementsByTagName() begin with getElements. Thats plural because there can be as many elements with the same class or tag name as you want. This both methods create a html node collection. All html elements with the asked class or tag name will be listet in those collections.
All elements have an index, just like an array. The indexes start at 0. To access a particular element, the desired index must be specified. The two class names fs-12 body-font-bold (class names are seperated by spaces, you can also build a node collection by using only one class name) deliver 2 html elements to the node collection. You want the second one so you must use the index 1.
This is the VBA code for the asked page by using the IE:
Sub FlightStat_AF()
Dim url As String
Dim ie As Object
'You can handle the parameters id and pfx in a loop to scrape dynamic numbers
url = "https://www.afklcargo.com/mycargo/shipment/detail/057-92366691"
'Initialize Internet Explorer, set visibility,
'call URL and wait until page is fully loaded
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = False
ie.navigate url
Do Until ie.readyState = 4: DoEvents: Loop
'Wait to load dynamic content after IE reports it's ready
'We do that with a fix manual break of a few seconds
'because the whole page will be "reload"
'The last three values are hours, minutes, seconds
Application.Wait (Now + TimeSerial(0, 0, 3))
'Get the status from the table
MsgBox Trim(ie.document.getElementsByClassName("fs-12 body-font-bold")(1).innerText)
'Clean up
ie.Quit
Set ie = Nothing
End Sub
Edit: Sub as function
This sub to test the function:
Sub testFunction()
Dim flightStatAfResult As String
flightStatAfResult = FlightStat_AF("057-92366691")
MsgBox flightStatAfResult
End Sub
This is the sub as function:
Function FlightStat_AF(cargoNo As String) As String
Dim url As String
Dim ie As Object
Dim result As String
'You can handle the parameters id and pfx in a loop to scrape dynamic numbers
url = "https://www.afklcargo.com/mycargo/shipment/detail/" & cargoNo
'Initialize Internet Explorer, set visibility,
'call URL and wait until page is fully loaded
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = False
ie.navigate url
Do Until ie.readyState = 4: DoEvents: Loop
'Wait to load dynamic content after IE reports it's ready
'We do that with a fix manual break of a few seconds
'because the whole page will be "reload"
'The last three values are hours, minutes, seconds
Application.Wait (Now + TimeSerial(0, 0, 3))
'Get the status from the table
result = Trim(ie.document.getElementsByClassName("fs-12 body-font-bold")(1).innerText)
'Clean up
ie.Quit
Set ie = Nothing
'Return value of the function
FlightStat_AF = result
End Function

VBA to click on a button on an IE form to submit it

I'm quite new to VBA so please bear with me. I've been trying to create an automation to fill in username and password and login to a site (to start with) but I've been having trouble trying to click on the submit button. Scoured the internet and learnt a whole bunch of things but I didnt find anything that seems to work. The page loads and fills in the details and nothing happens when I run the code below.
Would greatly appreciate some help with this. Thanks in advance as always!
Sub worldcheck()
Dim lastrow As Long
Dim IE As Object
Dim cel As Range
Dim post As Object
Dim ws As Worksheet
Dim element As Object
Set ws = Sheets("sheet1")
Set IE = CreateObject("internetexplorer.application")
lastrow = ws.Range("B" & ws.Rows.Count).End(xlUp).Row
IE.Visible = True
IE.Navigate "https://www.world-check.com/frontend/login/"
Do While IE.busy
DoEvents
Loop
Application.Wait (Now + TimeValue("0:00:2"))
IE.document.getElementbyID("username").Value = ws.Range("D2")
IE.document.getElementbyID("password").Value = ws.Range("D3")
IE.document.getElementbyClass("button").click
Do While IE.busy
DoEvents
Loop
End Sub
Nothing else happens? You should be getting an error message at the very least as you are trying to use a non existent method (VBA Run-time error 438 Object doesn't support this property or method) . The method is getElementsByClassName - note the s indicating it returns a collection and the ending is ClassName. You would then need to index into that collection before attempting to access the Click method
As there is only a single element with classname button you can use a faster css class selector (this is also faster than using a type selector of form; likewise, you can use the faster css equivalent of getElementById for the other two DOM elements). document.querySelector stops at the first match so is also more efficient.
Finally, rather than hard coded waits use proper page load waits as shown below:
Option Explicit
Public Sub WorldCheck()
Dim ie As Object
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = True
.Navigate2 "https://www.world-check.com/frontend/login/"
While .busy Or .readystate <> 4: DoEvents: Wend
With .document
.querySelector("#username").Value = "ABCDEF" ' equivalent to .getElementbyID("username").Value = "ABCDEF"
.querySelector("#password").Value = "GHIJKL" '.getElementbyID("password").Value = "GHIJKL"
.querySelector(".button").Click
End With
While .busy Or .readystate <> 4: DoEvents: Wend
Stop '<== delete me later
.Quit
End With
End Sub

How to automate check for multiple website time taken to load from excel?

I have around 100 websites in an excel for which i need to check the time taken for website to load.
Currently I am manually doing it in Internet Explorer by using developer tool(F12)> Network> Taken(Column) one by one.
Is there any way to do this check automatically??
Not sure your level of VBA, and this is far from perfect, but as a starting point this is a code I normally use to scrape websites, but I have modified slightly to time how long it takes each website to open. This assumes you have all your websites in column A, it will print the time taken to open a website in column B. Bare in mind this is timing the time the macro takes to run per website, so not entirely accurate to how long the actual site takes to load.
Enum READYSTATE
READYSTATE_UNINITIALIZED = 0
READYSTATE_LOADING = 1
READYSTATE_LOADED = 2
READYSTATE_INTERACTIVE = 3
READYSTATE_COMPLETE = 4
End Enum
Sub ImportWebsiteDate()
'to refer to the running copy of Internet Explorer
Dim ie As InternetExplorer, WDApp As Object, staTime As Double, elapsedTime As Double
'to refer to the HTML document returned
Dim html As HTMLDocument, websiteNo As Integer, curSite As String
websiteNo = Range("A500").End(xlUp).Row
'open Internet Explorer in memory, and go to website
Set ie = New InternetExplorer
For i = 1 To websiteNo
curSite = Cells(i, 1).Value
ie.Visible = False
staTime = Timer
ie.navigate curSite
'Wait until IE is done loading page
Do While ie.READYSTATE <> READYSTATE_COMPLETE
Application.StatusBar = "Trying to go to " & curSite
DoEvents
Loop
elapsedTime = Round(Timer - staTime, 2)
Cells(i, 2).Value = elapsedTime
Next i
Set ie = Nothing
Application.StatusBar = False
Application.ScreenUpdating = False
End Sub
This won't just work if you copy and paste, read this site for info on referencing the required applications.

Error "Object variable or with block variable not set" when using getElementsByClassName

I am want to scrap from amazon some fields.
Atm I am using a link and my vba script returns me name and price.
For example:
I put the link into column A and get the other fields in the respective columns, f.ex.: http://www.amazon.com/GMC-Denali-Black-22-5-Inch-Medium/dp/B00FNVBS5C/ref=sr_1_1?s=outdoor-recreation&ie=UTF8&qid=1436768082&sr=1-1&keywords=bicycle
However, I would also like to have the product description.
Here is my current code:
Sub ScrapeAmz()
Dim Ie As New InternetExplorer
Dim WebURL
Dim Docx As HTMLDocument
Dim productDesc
Dim productTitle
Dim price
Dim RcdNum
Ie.Visible = False
For RcdNum = 2 To ThisWorkbook.Worksheets(1).Range("A65536").End(xlUp).Row
WebURL = ThisWorkbook.Worksheets(1).Range("A" & RcdNum)
Ie.Navigate2 WebURL
Do Until Ie.ReadyState = READYSTATE_COMPLETE
DoEvents
Loop
Set Docx = Ie.Document
productTitle = Docx.getElementById("productTitle").innerText
'productDesc = Docx.getElementsByClassName("productDescriptionWrapper")(0).innerText
price = Docx.getElementById("priceblock_ourprice").innerText
ThisWorkbook.Worksheets(1).Range("B" & RcdNum) = productTitle
'ThisWorkbook.Worksheets(1).Range("C" & RcdNum) = productDesc
ThisWorkbook.Worksheets(1).Range("D" & RcdNum) = price
Next
End Sub
I am trying to get the product description by using productDesc = Docx.getElementsByClassName("productDescriptionWrapper")(0).innerText.
However, I get an error.
Object variable or with block variable not set.
Any suggestion why my statement does not work?
I appreciate your replies!
I'm pretty sure your problem is being caused by attempting to access the document before it's completely loaded. You're just checking ie.ReadyState.
This is my understanding of the timeline for loading a page with an IE control.
Browser connects to page: ie.ReadyState = READYSTATE_COMPLETE. At this point, you can access ie.document without causing an error, but the document has only started loading.
Document fully loaded: ie.document.readyState = "complete"
(note that frames may still be loading and AJAX processing may still be occurring.)
So you really need to check for two events.
Do
If ie.ReadyState = READYSTATE_COMPLETE Then
If ie.document.readyState = "complete" Then Exit Do
End If
Application.Wait DateAdd("s", 1, Now)
Loop
edit: after actually looking at the page you're trying to scrape, it looks like the reason it's failing is because the content you're trying to get at is in an iframe. You need to go through the iframe before you can get to the content.
ie.document.window.frames("product-description-iframe").contentWindow.document.getElementsByClassName("productDescriptionWrapper").innerText

excel vba copying fedex tracking information

I want to copy into Excel 3 tracking information tables that website generates when I track a parcel. I want to do it through Excel VBA. I can write a loop and generate this webpage for various tracking numbers. But I am having a hard time copying tables - the top table, travel history and shipments track table. Any solution? In my vba code last 3 lines below are giving an error :( - run time error '438' Object doesn't support this property or error.
Sub final()
Application.ScreenUpdating = False
Set ie = CreateObject("InternetExplorer.Application")
my_url = "https://www.fedex.com/fedextrack/index.html?tracknumbers=713418602663&cntry_code=us"
With ie
.Visible = True
.navigate my_url
Do Until Not ie.Busy And ie.readyState = 4
DoEvents
Loop
End With
ie.document.getElementById("detailsBody").Value
ie.document.getElementById("trackLayout").Value
ie.document.getElementById("detail").Value
End Sub
.Value is not a method available in that context. also, you will want to assign the return value of the method call to a variable. Also, you should declare your variables :)
I made some modifications and include one possible way of getting data from one of the tables. YOu may need to reformat the output using TextToColumns or similar, since it prints each row in a single cell.
I also notice that when I execute this, the tables have sometimes not finished loading and the result will be an error unless you put in a suitable Wait or use some other method to determine when the data has fully loaded on the webpage. I use a simple Application.Wait
Option Explicit
Sub final()
Dim ie As Object
Dim my_url As String
Dim travelHistory As Object
Dim history As Variant
Dim h As Variant
Dim i As Long
Application.ScreenUpdating = False
Set ie = CreateObject("InternetExplorer.Application")
my_url = "https://www.fedex.com/fedextrack/index.html?tracknumbers=713418602663&cntry_code=us"
With ie
.Visible = True
.navigate my_url
'## I modified this logice a little bit:
Do While .Busy And .readyState <> 4
DoEvents
Loop
End With
'## Here is a simple method wait for IE to finish, you may need a more robust solution
' For assistance with that, please ask a NEW question.
Application.Wait Now() + TimeValue("0:00:10")
'## Get one of the tables
Set travelHistory = ie.Document.GetElementByID("travel-history")
'## Split teh table to an array
history = Split(travelHistory.innerText, vbLf)
'## Iterate the array and write each row to the worksheet
For Each h In history
Range("A1").Offset(i).Value = h
i = i + 1
Next
ie.Quit
Set ie = Nothing
End Sub

Resources