I have this code, which copy the content of a website to Excel:
Sub HTML_Table_To_Excel()
Dim htm As Object
Dim Tr As Object
Dim Td As Object
Dim Tab1 As Object
Web_URL = "https://www.fxstreet.com/economic-calendar"
'Create HTMLFile Object
Set HTML_Content = CreateObject("htmlfile")
'Get the WebPage Content to HTMLFile Object
With CreateObject("msxml2.xmlhttp")
.Open "GET", Web_URL, False
.send
HTML_Content.body.innerHTML = .responseText 'this is the highlighted part
for the error
End With
Column_Num_To_Start = 1
iRow = 1
iCol = 1
iTable = 1
'Loop Through Each Table and Download it to Excel in Proper Format
For Each Tab1 In HTML_Content.getElementsByTagName("table")
With HTML_Content.getElementsByTagName("table")(iTable)
For Each Tr In .Rows
For Each Td In Tr.Cells
Worksheets("Sheet1").Cells(iRow, iCol).Select
Worksheets("Sheet1").Cells(iRow, iCol) = Td.innerText
iCol = iCol + 1
Next Td
iCol = Column_Num_To_Start
iRow = iRow + 1
Next Tr
End With
Next Tab1
MsgBox "Process Completed"
End Sub
This coding was working so well, but now is not working; "a message appear: Run-Time '429': ActiveX component can't create object " ...... after this, If I select Debug, a part of the code will highlight:
Set HTML_Content = CreateObject("htmlfile")
What's the problem?
So this isn't going to be a nice here you go answer. At least not at present.
XMLHTTP request will not work here. The page is dynamically loaded and the content simply isn't there via the method you are using; which is executed before this info is available.
You should always use Option Explicit at the top of your module. You have numerous undeclared variables (e.g. Web_URL As String) which will be generated on the fly as variants. And you are not catching what look like either typos/inconsistent variable naming: Dim htm As Object: Set HTML_Content = CreateObject("htmlfile").
I suspect that you wanted Set htm = CreateObject("htmlfile") . This would be an object type consistent with your existing naming and declaration. HTML_Content would be a string, and would not require the Set keyword which assigns an object reference. Here, I think you wanted HTML_Content = .responseText, but in fact it is safer to ensure any returned string is unencoded with HTML_Content = StrConv(.responseBody, vbUnicode)
There is only one table of interest, as far as I can see, and it has an id. If you were to continue with your method then after this line: For Each Tab1 In HTML_Content.getElementsByTagName("table"), each Tab1 would be an HTMLTable object so you wouldn't need the next line with iTable variable. You are using a For Each Loop so are already iterating the parent collection. Simply, grab the table by its id: With .document.getElementById("fxst-calendartable"). id is the fastest retrieval method available so should be preferred over all other methods when available.
The table is either poorly designed or deliberately designed to make scraping difficult. You can neither copy the object outerHTML to clipboard and paste the table to Excel, nor simply loop table rows and table cells to get all the content displayed. And the content that is displayed is not in a nice legible fashion. You may have some luck with web queries but in my experience Javascript heavy pages like this don't mix well with data from web queries.
So, with those points in mind:
You are going to likely need to use a browser to navigate and ensure page content has loaded e.g. While ie.Busy Or ie.readyState < 4: DoEvents: Wend, possibly with some extra explicit wait time
It is also likely you will need to inspect the html of the table closely to write something fit for purpose to grab the elements you are interested in, and write them out in a coherent fashion to the page. For example, it is clear you will need to consider the role and placement of div elements within the table for correct alignment of data in your output.
You could simply dump the above approach and screen shot the timetable or print it out. Consider whether the effort of coding this is worth it over simply printing the page once a week (unless page is frequently updated).
Related
I am experimenting with web automation and struggling a bit trying to utilize a drop down list.
My code works up to the point of searching for a company name and hitting "go". On the new page I can't seem to find the right code that selects the group of elements that represents the drop down list. I then want to select "100" entries, but I can't even grab the nodes that represent this list.
I have been browsing multiple different pages on stackoverflow that talk about CSS selectors and looked at tutorials but that doesn't seem to help either. I either end up grabbing nothing, or whatever I grab can't use the getElementsByTagName method, which ultimately I am trying to drill down into the td and select nodes . Not sure what to do with those yet, but I can't even grab them. Thoughts?
(note stopline is just a line that I use a breakpoint on to stop my code)
CSS helper website: https://www.w3schools.com/cssref/trysel.asp
Code:
Option Explicit
Sub test()
On Error GoTo ErrHandle
Dim ie As New InternetExplorer
Dim doc As New HTMLDocument
Dim ws As Worksheet
Dim stopLine As Integer
Dim oSearch As Object, oSearchButton As Object
Dim oForm As Object
Dim oSelect As Object
Dim list As Object
Set ws = ThisWorkbook.Worksheets("Sheet1")
ie.Visible = True
ie.navigate "https://www.sec.gov/edgar/searchedgar/companysearch.html"
Do
DoEvents
Loop Until ie.readyState = READYSTATE_COMPLETE
Set doc = ie.Document
Set oSearch = doc.getElementById("companysearchform")
Set oSearchButton = oSearch.getElementsByTagName("input")(1)
Set oSearch = oSearch.getElementsByTagName("input")(0)
oSearch.Value = "Summit Midstream Partners, LP"
oSearchButton.Click
Do
DoEvents
Loop Until ie.readyState = READYSTATE_COMPLETE
Set doc = ie.Document
Set list = doc.querySelectorAll("td select")
stopLine = 1
Exit Sub
ErrHandle:
MsgBox Err.Number & " - " & Err.Description, vbCritical
Exit Sub
End Sub
td select will return a single node so you only need querySelector. The node has an id so you might as well use the quicker querySelector("#count") to target the parent select. To change the option you can then use SelectedIndex on the parent select, or, target the child option by its value attribute querySelector("[value='100']").Selected = True. You may then need to attach and trigger change/onchange htmlevent to the parent select to register the change.
However, I would simply extract the company CIK from current page then concatenate the count=100 param into the url and .Navigate2 that using following format:
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001549922&type=&dateb=&owner=include&count=100&search_text=
You can extract CIK, after initial search company click and wait for page load, with:
Dim cik As String
cik = ie.document.querySelector("[name=CIK]").value
ie.Navigate2 "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" & cik & "&type=&dateb=&owner=include&count=100&search_text="
Given several params are left blank you can likely shorten to:
"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" & cik & "&owner=include&count=100"
If you are unable to get the initial parent select you probably need a timed loop waiting for that element to be present after clicking the search button. An example is shown here in a StackOverflow answer.
I'm trying to extract the stats table from This page into Excel so that I can refresh automatically. I tried navigating through the Get External Data process, but where the DIV with the table should be, it shows as null instead of as a Table element. I suspect it's how the site is coded, as looking at the source code, that table is generated from a script, even though it just looks like standard table nomenclature as a table element in the actual display:
So I tried some macros for this purpose, such as the following:
Sub Extract_data()
Dim url As String, links_count As Integer
Dim i As Integer, j As Integer, row As Integer
Dim XMLHTTP As Object, html As Object
Dim tr_coll As Object, tr As Object
Dim td_coll As Object, td As Object
links_count = 39
For i = 0 To links_count
url = "http://www.admision.unmsm.edu.pe/res20130914/A/011/" & i & ".html"
Set XMLHTTP = CreateObject("MSXML2.XMLHTTP")
XMLHTTP.Open "GET", url, False
XMLHTTP.send
Set html = CreateObject("htmlfile")
html.body.innerHTML = XMLHTTP.ResponseText
Set tbl = html.getelementsbytagname("Table")
Set tr_coll = tbl(0).getelementsbytagname("TR")
For Each tr In tr_coll
j = 1
Set td_col = tr.getelementsbytagname("TD")
For Each td In td_col
Cells(row + 1, j).Value = td.innerText
j = j + 1
Next
row = row + 1
Next
Next
End Sub
However, that only pulls the same tables I have access to already with Get External Data. It doesn't seem they can 'see' the table on the page. Is there a way to VBA code such that it will pull the actual table on the page, instead of just checking the page source code?
On a sidenote, you can see from the screenshot code above that there is a CSV export link on the page. However, because this is generated with JS, it just shows up as a huge string of characters beginning with data:application and not an actual link that can be refreshed (and I suspect it could not be anyway since the characters likely change as the table parameters do). It does have a download attribute attached for the filename, is there a way to work backwards from that attribute to get Excel to find the file?
I'll take any method I can get. Thanks!
I am trying to scrape IPO date from crunchbase.
Unfortunately I get Runtime Error 1004 “Application-defined or Object-defined error”.
My goal is to save IPO date in A1 cell.
Sub GetIE()
Dim IE As Object
Dim URL As String
Dim myValue As IHTMLElement
URL = "https://www.crunchbase.com/organization/verastem"
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
IE.Navigate URL
Do While IE.Busy Or IE.ReadyState <> 4
DoEvents
Loop
Set myValue = IE.Document.getElementsByClassName("post_glass post_micro_glass")(0)
Range("A1").Value = myValue
Set IE = Nothing
End Sub
I can't find that class name in the html for that url. You can use the css selector I show below which can be scraped by xmlhttp and thus avoiding opening a browser
Option Explicit
Public Sub GetDate()
Dim html As HTMLDocument
Set html = New HTMLDocument '< VBE > Tools > References > Microsoft Scripting Runtime
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.crunchbase.com/organization/verastem#section-overview", False
.send
html.body.innerHTML = .responseText
End With
ActiveSheet.Range("A1") = html.querySelectorAll(".field-type-date.ng-star-inserted").item(1).innerText
End Sub
If you don't want to use compound classes then you can also use
ActiveSheet.Range("A1") = html.querySelectorAll("#section-ipo-stock-price .field-type-date").item(1).innerText
You can see the relevant html here:
Note the element has multiple (compound) classes
<span class="component--field-formatter field-type-date ng-star-inserted" title="Jan 27, 2012">Jan 27, 2012</span>
There are 3 classes component--field-formatter ; field-type-date and ng-star-inserted. I use two of these in combination in the first solution I give. Multiple classes is popular now-a-days due to the versatility it gives in page styling e.g. it allows overriding styles easily. You can read about css specificity* to understand this better.
More classes may mean the code is a little less robust as the ordering of classes may be changed and a class, or more, may be removed. This was raised by #SIM in a comment on an answer to another web-scraping question. Thus, I offer one solution with two of the classes used, and another solution with only one of the classes used.
Whilst you do get the same date for this page with simply:
ActiveSheet.Range("A1") = html.querySelector("#section-ipo-stock-price .field-type-date").innerText
I wouldn't want to assume that would always hold true as it grabs the date from the line where it says "Their stock opened".
* https://developer.mozilla.org/en-US/docs/Web/CSS/Specificity
References:
querySelectorAll
css selectors
Morning,
I am having trouble with a webscrape from Excel, whereby getelementsbyclassname is failing to act on some objects, throwing up the "Object doesn't support this property or method" error.
The problem appears when the object I am feeding into getelementsbyclassname is itself the result of a getelementsbyclassname method. I am not sure why, particularly as I can get the class name when acting on a larger object...
Here is a code extract
''''Boring Variables Declaration I've cut out''''
'Initialise IE
Dim IEApp As New InternetExplorer
Set IEApp = New InternetExplorer
IEApp.Visible = True 'JB
'Open page and wait for page to load
IEApp.navigate ("http://www.anicewebsite.com")
Do Until IEApp.readyState = READYSTATE_COMPLETE And IEApp.Busy = False
DoEvents
Loop
Set HTMLdoc = IEApp.document
Set RefLocation = Sheets("INFO_DUMP").Range("LocationRefCell")
Set trElements = HTMLdoc.getElementsByClassName("basic-details")
For Each trElement In trElements
'Select the LHS box and extract info
Set tdElement = trElement.getElementsByClassName("tieredToggle")
'write start/end locations
'''''THIS NEXT LINE THROWS AN ERROR'''''
Data_str = tdElement.getElementsByClassName("title").innerText
'''''AS DOES'''''
MyObject=tdElement.getElementsByClassName("title")
RefLocation.Offset(1, 2).Value = Data_str
Next 'close tr Loop
However, I can get the 'title' object via
For Each trElement In trElements
Set MyObject=trElement.getElementsByClassName("title")
Next 'close tr Loop
so the error is, presumably, something about tdElement (a DispHTMLElement Collection), which I tried to attach an image of but I lack the reputation (see link at end of post)...
Many thanks for any help.
PS. the webpage is structured, roughly, with a 2-column table whose rows I isolate with "basic-details". The first column is the "tiered toggle" and then the items I want are inner text in eg. "title". I need to use tieredtoggle as objects in each column have repeated class names
http://i.stack.imgur.com/1tyb6.png
You can use this to get the innertext.
Data_str = tdElement.getElementsByClassName("title")(0).innerText
Instead of ("title")(0) you can enter the index value where the element is present.
There are many online resources that illustrate using Microsoft Internet Explorer Controls within VBA Excel to perform basic IE automation tasks. These work when the webpage has a basic construct. However, when webpages contain multiple frames they can be difficult to work with.
I need to determine if an individual frame within a webpage has completely loaded. For example, this VBA Excel code opens IE, loads a webpage, loops thru an Excel sheet placing data into the webpage fields, executes search, and then returns the IE results data to Excel (my apologies for omitting the site address).
The target webpage contains two frames:
1) The searchbar.asp frame for search value input and executing search
2) The searchresults.asp frame for displaying search results
In this construct the search bar is static, while the search results change according to input criteria. Because the webpage is built in this manner, the IEApp.ReadyState and IEApp.Busy cannot be used to determine IEfr1 frame load completion, as these properties do not change after the initial search.asp load. Therefore, I use a large static wait time to avoid runtime errors as internet traffic fluctuates. This code does work, but is slow. Note the 10 second wait after the cmdGO statement. I would like to improve the performance by adding solid logic to determine the frame load progress.
How do I determine if an autonomous frame has finished loading?
' NOTE: you must add a VBA project reference to "Internet Explorer Controls"
' in order for this code to work
Dim IEapp As Object
Dim IEfr0 As Object
Dim IEfr1 As Object
' Set new IE instance
Set IEapp = New InternetExplorer
' With IE object
With IEapp
' Make visible on desktop
.Visible = True
' Load target webpage
.Navigate "http://www.MyTargetWebpage.com/search.asp"
' Loop until IE finishes loading
While .ReadyState <> READYSTATE_COMPLETE
DoEvents
Wend
End With
' Set the searchbar.asp frame0
Set IEfr0 = IEapp.Document.frames(0).Document
' For each row in my worksheet
For i = 1 To 9999
' Input search values into IEfr0 (frame0)
IEfr0.getElementById("SearchVal1").Value = Cells(i, 5)
IEfr0.getElementById("SearchVal2").Value = Cells(i, 6)
' Execute search
IEfr0.all("cmdGo").Click
' Wait a fixed 10sec
Application.Wait (Now() + TimeValue("00:00:10"))
' Set the searchresults.asp frame1
Set IEfr1 = IEapp.Document.frames(1).Document
' Retrieve webpage results data
Cells(i, 7) = Trim(IEfr1.all.Item(26).innerText)
Cells(i, 8) = Trim(IEfr1.all.Item(35).innerText)
Next
As #JimmyPena said. it's a lot easier to help if we can see the URL.
If we can't, hopefully this overview can put you in the right direction:
Wait for page to load (IEApp.ReadyState and IEApp.Busy)
Get the document object from the IE object. (done)
Loop until the document object is not nothing.
Get the frame object from the document object.
Loop until the frame object is not nothing.
Hope this helps!
I used loop option to check the field value until its populated like this
Do While IE.Document.getElementById("USERID").Value <> "test3"
IE.Document.getElementById("USERID").Value = "test3"
Loop
this is a Rrrreeally old thread, but I figured I would post my findings, because I came here looking for an answer...
Looking in the locals window, I could see that the "readystate" variable was only "READYSTATE_COMPLETE" for the IE App itself. but for the iframe, it was lowercase "complete"
So I explored this by using a debug.print loop on the .readystate of the frame I was working with.
Dim IE As Object
Dim doc As MSHTML.HTMLDocument
Set doc = IE.Document
Dim iframeDoc As MSHTML.HTMLDocument
Set iframeDoc = doc.Frames("TheFrameIwasWaitingFor").Document
' then, after I had filled in the form and fired the submit event,
Debug.Print iframeDoc.readyState
Do Until iframeDoc.readyState = "complete"
Debug.Print iframeDoc.readyState
DoEvents
Loop
So this will show you line after line of "loading" in the immediate window, eventually showing "complete" and ending the loop. it can be abridged to remove the debug.prints of course.
another thing:
debug.print iframeDoc.readystate ' is the same as...
debug.print doc.frames("TheFrameIwasWaitingFor").Document.readystate
' however, you cant use...
IE.Document.frames("TheFrameIwasWaitingFor").Document.readystate ' for some reason...
forgive me if all of this is common knowledge. I really only picked up VBA scripting a couple days ago...