Excel vba getElementsByClassName - excel

I am trying to scrape IPO date from crunchbase.
Unfortunately I get Runtime Error 1004 “Application-defined or Object-defined error”.
My goal is to save IPO date in A1 cell.
Sub GetIE()
Dim IE As Object
Dim URL As String
Dim myValue As IHTMLElement
URL = "https://www.crunchbase.com/organization/verastem"
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
IE.Navigate URL
Do While IE.Busy Or IE.ReadyState <> 4
DoEvents
Loop
Set myValue = IE.Document.getElementsByClassName("post_glass post_micro_glass")(0)
Range("A1").Value = myValue
Set IE = Nothing
End Sub

I can't find that class name in the html for that url. You can use the css selector I show below which can be scraped by xmlhttp and thus avoiding opening a browser
Option Explicit
Public Sub GetDate()
Dim html As HTMLDocument
Set html = New HTMLDocument '< VBE > Tools > References > Microsoft Scripting Runtime
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.crunchbase.com/organization/verastem#section-overview", False
.send
html.body.innerHTML = .responseText
End With
ActiveSheet.Range("A1") = html.querySelectorAll(".field-type-date.ng-star-inserted").item(1).innerText
End Sub
If you don't want to use compound classes then you can also use
ActiveSheet.Range("A1") = html.querySelectorAll("#section-ipo-stock-price .field-type-date").item(1).innerText
You can see the relevant html here:
Note the element has multiple (compound) classes
<span class="component--field-formatter field-type-date ng-star-inserted" title="Jan 27, 2012">Jan 27, 2012</span>
There are 3 classes component--field-formatter ; field-type-date and ng-star-inserted. I use two of these in combination in the first solution I give. Multiple classes is popular now-a-days due to the versatility it gives in page styling e.g. it allows overriding styles easily. You can read about css specificity* to understand this better.
More classes may mean the code is a little less robust as the ordering of classes may be changed and a class, or more, may be removed. This was raised by #SIM in a comment on an answer to another web-scraping question. Thus, I offer one solution with two of the classes used, and another solution with only one of the classes used.
Whilst you do get the same date for this page with simply:
ActiveSheet.Range("A1") = html.querySelector("#section-ipo-stock-price .field-type-date").innerText
I wouldn't want to assume that would always hold true as it grabs the date from the line where it says "Their stock opened".
* https://developer.mozilla.org/en-US/docs/Web/CSS/Specificity
References:
querySelectorAll
css selectors

Related

Excel VBA IE Object and using dropdown list

I am experimenting with web automation and struggling a bit trying to utilize a drop down list.
My code works up to the point of searching for a company name and hitting "go". On the new page I can't seem to find the right code that selects the group of elements that represents the drop down list. I then want to select "100" entries, but I can't even grab the nodes that represent this list.
I have been browsing multiple different pages on stackoverflow that talk about CSS selectors and looked at tutorials but that doesn't seem to help either. I either end up grabbing nothing, or whatever I grab can't use the getElementsByTagName method, which ultimately I am trying to drill down into the td and select nodes . Not sure what to do with those yet, but I can't even grab them. Thoughts?
(note stopline is just a line that I use a breakpoint on to stop my code)
CSS helper website: https://www.w3schools.com/cssref/trysel.asp
Code:
Option Explicit
Sub test()
On Error GoTo ErrHandle
Dim ie As New InternetExplorer
Dim doc As New HTMLDocument
Dim ws As Worksheet
Dim stopLine As Integer
Dim oSearch As Object, oSearchButton As Object
Dim oForm As Object
Dim oSelect As Object
Dim list As Object
Set ws = ThisWorkbook.Worksheets("Sheet1")
ie.Visible = True
ie.navigate "https://www.sec.gov/edgar/searchedgar/companysearch.html"
Do
DoEvents
Loop Until ie.readyState = READYSTATE_COMPLETE
Set doc = ie.Document
Set oSearch = doc.getElementById("companysearchform")
Set oSearchButton = oSearch.getElementsByTagName("input")(1)
Set oSearch = oSearch.getElementsByTagName("input")(0)
oSearch.Value = "Summit Midstream Partners, LP"
oSearchButton.Click
Do
DoEvents
Loop Until ie.readyState = READYSTATE_COMPLETE
Set doc = ie.Document
Set list = doc.querySelectorAll("td select")
stopLine = 1
Exit Sub
ErrHandle:
MsgBox Err.Number & " - " & Err.Description, vbCritical
Exit Sub
End Sub
td select will return a single node so you only need querySelector. The node has an id so you might as well use the quicker querySelector("#count") to target the parent select. To change the option you can then use SelectedIndex on the parent select, or, target the child option by its value attribute querySelector("[value='100']").Selected = True. You may then need to attach and trigger change/onchange htmlevent to the parent select to register the change.
However, I would simply extract the company CIK from current page then concatenate the count=100 param into the url and .Navigate2 that using following format:
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001549922&type=&dateb=&owner=include&count=100&search_text=
You can extract CIK, after initial search company click and wait for page load, with:
Dim cik As String
cik = ie.document.querySelector("[name=CIK]").value
ie.Navigate2 "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" & cik & "&type=&dateb=&owner=include&count=100&search_text="
Given several params are left blank you can likely shorten to:
"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" & cik & "&owner=include&count=100"
If you are unable to get the initial parent select you probably need a timed loop waiting for that element to be present after clicking the search button. An example is shown here in a StackOverflow answer.

Excel VBA - Web Scraping - Get value in HTML Table cell

I am trying to create a macro that scrapes a cargo tracking website.
But I have to create 4 such macros as each airline has a different website.
I am new to VBA and web scraping.
I have put together a code that works for 1 website. But when I tried to replicate it for another one, I am stuck in the loop. I think it maybe how I am referring to the element, but like I said, I am new to VBA and have no clue about HTML.
I am trying to get the "notified" value in the highlighted line from the image.
IMAGE:"notified" text to be extracted
Below is the code I have written so far that gets stuck in the loop.
Any help with this would be appreciated.
Sub FlightStat_AF()
Dim url As String
Dim ie As Object
Dim nodeTable As Object
'You can handle the parameters id and pfx in a loop to scrape dynamic numbers
url = "https://www.afklcargo.com/mycargo/shipment/detail/057-92366691"
'Initialize Internet Explorer, set visibility,
'call URL and wait until page is fully loaded
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = False
ie.navigate url
Do Until ie.readyState = 4: DoEvents: Loop
'Wait to load dynamic content after IE reports it's ready
'We can do that in a loop to match the point the information is available
Do
On Error Resume Next
Set nodeTable = ie.document.getElementByClassName("block-whisper")
On Error GoTo 0
Loop Until Not nodeTable Is Nothing
'Get the status from the table
MsgBox Trim(nodeTable.getElementsByClassName("fs-12 body-font-bold").innerText)
'Clean up
ie.Quit
Set ie = Nothing
Set nodeTable = Nothing
End Sub
Some basics:
For simple accesses, like the present ones, you can use the get methods of the DOM (Document Object Model). But there is an important difference between getElementByID() and getElementsByClassName() / getElementsByTagName().
getElementByID() searches for the unique ID of a html tag. This is written as the ID attribute to html tags. If the html standard is kept by the page, there is only one element with this unique ID. That's the reason why the method begins with getElement.
If the ID is not found when using the method, VBA throws a runtime error. Therefore the call is encapsulated in the loop from the other answer from me, into switching off and on again the error handling. But in the page from this question there is no ID for the html area in question.
Instead, the required element can be accessed directly. You tried the access with getElementsByClassName(). That's right. But here comes the difference to getElementByID().
getElementsByClassName() and getElementsByTagName() begin with getElements. Thats plural because there can be as many elements with the same class or tag name as you want. This both methods create a html node collection. All html elements with the asked class or tag name will be listet in those collections.
All elements have an index, just like an array. The indexes start at 0. To access a particular element, the desired index must be specified. The two class names fs-12 body-font-bold (class names are seperated by spaces, you can also build a node collection by using only one class name) deliver 2 html elements to the node collection. You want the second one so you must use the index 1.
This is the VBA code for the asked page by using the IE:
Sub FlightStat_AF()
Dim url As String
Dim ie As Object
'You can handle the parameters id and pfx in a loop to scrape dynamic numbers
url = "https://www.afklcargo.com/mycargo/shipment/detail/057-92366691"
'Initialize Internet Explorer, set visibility,
'call URL and wait until page is fully loaded
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = False
ie.navigate url
Do Until ie.readyState = 4: DoEvents: Loop
'Wait to load dynamic content after IE reports it's ready
'We do that with a fix manual break of a few seconds
'because the whole page will be "reload"
'The last three values are hours, minutes, seconds
Application.Wait (Now + TimeSerial(0, 0, 3))
'Get the status from the table
MsgBox Trim(ie.document.getElementsByClassName("fs-12 body-font-bold")(1).innerText)
'Clean up
ie.Quit
Set ie = Nothing
End Sub
Edit: Sub as function
This sub to test the function:
Sub testFunction()
Dim flightStatAfResult As String
flightStatAfResult = FlightStat_AF("057-92366691")
MsgBox flightStatAfResult
End Sub
This is the sub as function:
Function FlightStat_AF(cargoNo As String) As String
Dim url As String
Dim ie As Object
Dim result As String
'You can handle the parameters id and pfx in a loop to scrape dynamic numbers
url = "https://www.afklcargo.com/mycargo/shipment/detail/" & cargoNo
'Initialize Internet Explorer, set visibility,
'call URL and wait until page is fully loaded
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = False
ie.navigate url
Do Until ie.readyState = 4: DoEvents: Loop
'Wait to load dynamic content after IE reports it's ready
'We do that with a fix manual break of a few seconds
'because the whole page will be "reload"
'The last three values are hours, minutes, seconds
Application.Wait (Now + TimeSerial(0, 0, 3))
'Get the status from the table
result = Trim(ie.document.getElementsByClassName("fs-12 body-font-bold")(1).innerText)
'Clean up
ie.Quit
Set ie = Nothing
'Return value of the function
FlightStat_AF = result
End Function

Set HTML_Content = CreateObject("htmlfile") doesn't work

I have this code, which copy the content of a website to Excel:
Sub HTML_Table_To_Excel()
Dim htm As Object
Dim Tr As Object
Dim Td As Object
Dim Tab1 As Object
Web_URL = "https://www.fxstreet.com/economic-calendar"
'Create HTMLFile Object
Set HTML_Content = CreateObject("htmlfile")
'Get the WebPage Content to HTMLFile Object
With CreateObject("msxml2.xmlhttp")
.Open "GET", Web_URL, False
.send
HTML_Content.body.innerHTML = .responseText 'this is the highlighted part
for the error
End With
Column_Num_To_Start = 1
iRow = 1
iCol = 1
iTable = 1
'Loop Through Each Table and Download it to Excel in Proper Format
For Each Tab1 In HTML_Content.getElementsByTagName("table")
With HTML_Content.getElementsByTagName("table")(iTable)
For Each Tr In .Rows
For Each Td In Tr.Cells
Worksheets("Sheet1").Cells(iRow, iCol).Select
Worksheets("Sheet1").Cells(iRow, iCol) = Td.innerText
iCol = iCol + 1
Next Td
iCol = Column_Num_To_Start
iRow = iRow + 1
Next Tr
End With
Next Tab1
MsgBox "Process Completed"
End Sub
This coding was working so well, but now is not working; "a message appear: Run-Time '429': ActiveX component can't create object " ...... after this, If I select Debug, a part of the code will highlight:
Set HTML_Content = CreateObject("htmlfile")
What's the problem?
So this isn't going to be a nice here you go answer. At least not at present.
XMLHTTP request will not work here. The page is dynamically loaded and the content simply isn't there via the method you are using; which is executed before this info is available.
You should always use Option Explicit at the top of your module. You have numerous undeclared variables (e.g. Web_URL As String) which will be generated on the fly as variants. And you are not catching what look like either typos/inconsistent variable naming: Dim htm As Object: Set HTML_Content = CreateObject("htmlfile").
I suspect that you wanted Set htm = CreateObject("htmlfile") . This would be an object type consistent with your existing naming and declaration. HTML_Content would be a string, and would not require the Set keyword which assigns an object reference. Here, I think you wanted HTML_Content = .responseText, but in fact it is safer to ensure any returned string is unencoded with HTML_Content = StrConv(.responseBody, vbUnicode)
There is only one table of interest, as far as I can see, and it has an id. If you were to continue with your method then after this line: For Each Tab1 In HTML_Content.getElementsByTagName("table"), each Tab1 would be an HTMLTable object so you wouldn't need the next line with iTable variable. You are using a For Each Loop so are already iterating the parent collection. Simply, grab the table by its id: With .document.getElementById("fxst-calendartable"). id is the fastest retrieval method available so should be preferred over all other methods when available.
The table is either poorly designed or deliberately designed to make scraping difficult. You can neither copy the object outerHTML to clipboard and paste the table to Excel, nor simply loop table rows and table cells to get all the content displayed. And the content that is displayed is not in a nice legible fashion. You may have some luck with web queries but in my experience Javascript heavy pages like this don't mix well with data from web queries.
So, with those points in mind:
You are going to likely need to use a browser to navigate and ensure page content has loaded e.g. While ie.Busy Or ie.readyState < 4: DoEvents: Wend, possibly with some extra explicit wait time
It is also likely you will need to inspect the html of the table closely to write something fit for purpose to grab the elements you are interested in, and write them out in a coherent fashion to the page. For example, it is clear you will need to consider the role and placement of div elements within the table for correct alignment of data in your output.
You could simply dump the above approach and screen shot the timetable or print it out. Consider whether the effort of coding this is worth it over simply printing the page once a week (unless page is frequently updated).

Export HTML to text file with different results

I have two codes .. that are supposed to export the html file to text file
Sub Demo1()
Dim http As New XMLHTTP60
Dim html As New HTMLDocument
With http
.Open "GET", "https://www.google.com.eg/", False
.send
html.body.innerHTML = .responseText
WriteTxtFile html.body.innerHTML
End With
End Sub
Sub WriteTxtFile(ByVal aString As String, Optional ByVal filePath As String = "C:\Users\Future\Desktop\Output.txt")
Dim fso As Object
Dim fileout As Object
Set fso = CreateObject("Scripting.FileSystemObject")
Set fileout = fso.CreateTextFile(filePath, True, True)
fileout.write aString
fileout.Close
End Sub
Sub Demo2()
Dim ie As Object
Dim f As Integer
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = True
.navigate ("https://www.google.com.eg/")
Do: DoEvents: Loop Until .readyState = 4
f = FreeFile()
Open ThisWorkbook.Path & "\Sample.txt" For Output As #f
Print #f, .document.body.innerHTML
Close #f
.Quit
End With
End Sub
Both Demo1 and Demo2 are the codes .. and they resulted in "Sample.txt" and "Output.txt"
But I found those html documents are different results
Can you help me to clarify what is the right one .. and why they are different?
Thanks advanced for help
Xmlhttp does not provide all the rendered content of a webpage. Particularly anything rendered via JavaScript execution. Any scripts are not executed.
Internet Explorer on the other hand will render the page (provided the browser version and JavaScript syntax is supported. For example, you will run into problems with the ec6 - latest Ecmascript as this is not supported on legacy browsers. It is I believe on Edge for Windows 10. You can check compatibility tables to see what is and isn’t supported ) fully.
If you familiarize yourself with dev tools for your browser you can explore how different parts of a webpage are rendered. You can learn to debug scripts and see what changes are made to the DOM and page styling. Often a page will issue XHR requests to update content on a page for example. If you want to have a play look here.
So, I suspect that the first html document may have less content and a different overall DOM structure from the second on this basis.
To test for differences due to writing to text file methodology you need to compare Apples with Apples i.e use the same scraping access method and syntax to retrieve the page content before writing out.
Please provide the differences if you want a deeper explanation.
Exploring page updating:
Firefox Network Tab
Internet Explorer Network Inspector
Chrome Network Tab

VBA: It only shows ID at Ispect Element

I'm a newbie at VBA programming, and I use it to build macro a in excel. Here is this webpage: http://www.eppraisal.com/home-values/property/14032-s-atlantic-ave-riverdale-il-60827-59143864/
I want to extract the value of $52,920 on the right upper corner. When I select the Inspect Element, this is the following:
<p id="eppraisalval">$52,920</p>
I checked the Page Source, but I couldn't find it. The closest thing I found:
<span id="eppraisal_val" class="valuation-estimates-price ajaxload" data-url="/home-values/property_lookup_eppraisal?a=14032%20S%20Atlantic%20Ave&z=60827&propid=59143864">loading...</span>
I tried the get getElementById("eppraisal_val") and getElementById("eppraisalval"), but none of those worked.
How can I address my code to get that element?
Here is more from the inspect element window, aroun that code:
<span id="eppraisal_val" class="valuation-estimates-price ajaxload" data-url="/home-values/property_lookup_eppraisal?a=14032%20S%20Atlantic%20Ave&z=60827&propid=59143864"><p id="eppraisalval">$52,920</p><p class="main-page-description-small valuation_details" style="display:none;margin-top:5px">Low: $44,982 <br>High: $60,858</p></span>
Here is the shorter version of code I tried:
Sub macroID()
On Error Resume Next
Dim ie As Object
Set ie = CreateObject("internetexplorer.application")
Dim doc As HTMLDocument
Dim valuation As String
Dim val1 As String, val2 As String
Set doc = ie.Document
ie.Visible = True
ie.navigate "http://www.eppraisal.com/home-values/property/14032-s-atlantic-ave-riverdale-il-60827-59143864/"
Do
DoEvents
Loop Until ie.readyState = READYSTATE_COMPLETE
Set doc = ie.Document
valuation = doc.getElementById("eppraisalval")
Cells(1, 4).Value = valuation
Application.Wait (Now + TimeValue("0:00:03"))
End Sub
The issue with scraping from that URL is that there is a verification page before hand which requires you to confirm you "are not a robot" which makes it hard to scrape anything from it.
If you manually do this first it may save this in your cache and then allow you to run macros freely to scrape the website however you'd have to try this out.
In the meantime, the only issue I could see with your code is that you haven't included .innertext after .getElementById("eppraisalval"). The valuation line should look like this:
valuation = doc.getElementById("eppraisalval").innerText

Resources