I spent several hours researching this, couldn't find a solution.
I'm trying to do a relatively simple VBA web scrape on google. I'm currently hung up on inputting text into google's search box. There are three different routes I have seen, all of which cause errors.
One route is the following code
Dim aEXPLORER as InternetExplorer
Set aEXPLORER = New InternetExplorer
Do While aEXPLORER.readyState <> READYSTATE_COMPLETE
Application.StatusBar = "Loading google..."
DoEvents
Loop
At this point, the first route goes like this:
aEXPLORER.Document.Getelementbyid("idhere").value
The problem is after .document my intellisense stops working. If I continue to write the code anyway it errors out.
Route two - add at the top dim aHTML as HTMLDocument
Set aHTML = aEXPLORER.document
aHTML.getElementById("idhere").value
This is marginally more successful as I get to the getelementbyID tag but my intellisense doesn't find a .value anywhere. I have a lot of other options like innertext, etc. But no .value like several people use in tutorials online.
I looked in my object libraries and I don't see a .value in my HTMLDocument library or my InternetExplorer library or my Excel library.
The problem has to be something with declaring a variable or an object or something of that nature. Why don't I ever get to a .value? This first method worked online for three different people showing tutorials, why isn't it working when I do it? How are they getting a Getelementbyid tag without setting up an HTML document at the top? For reference, this popular tutorial uses method one, "http://analystcave.com/web-scraping-tutorial/" about 75% of the way through the tutorial.
Thank you so much for being such helpful people. I'm at the end of my currently very limited rope.
Answer credit goes to Berco!
Problem 1: I was using the wrong ID identifier. Solution 1: Copy and Paste HTML code for each interaction point into VBA, comment it out, and copy/paste the identifier directly from that. This has several other benefits as well.
Problem 2: I was concerned with losing intellisense at .value. Solution 2: Not having intellisense does not mean the code won't work. It just means the code isn't set up with a dim/set. To solve for this one, you have to dim that item as an HTMLInput, which requires the Microsoft HTML Object Library reference to be checked. However, you may still use the .value etc even without intellisense as long as the HTML Object Library reference is checked.
The ID of the googles searchbar is (id="lst-ib"). Try something like:
dim searchbar as object
set searchbar = aEXPLORER.Document.Getelementbyid("lst-ib")
searchbar.value = "I wanna google this"
Answer credit goes to Berco!
Problem 1:
I was using the wrong ID identifier.
Solution 1:
Copy and Paste HTML code for each interaction point into VBA, comment it out, and copy/paste the identifier directly from that. This has several other benefits as well.
Problem 2:
I was concerned with losing intellisense at .value.
Solution 2:
Not having intellisense does not mean the code won't work. It just means the code isn't set up with a dim/set. To solve for this one, you have to dim that item as an HTMLInput, which requires the Microsoft HTML Object Library reference to be checked. However, you may still use the .value etc even without intellisense as long as the HTML Object Library reference is checked.
Related
So I understand how to pull data from a single weblink looking at tables. I cannot find not 1 tutorial anywhere on the web about how to do so getting it from Div elements and no one talks about it at all. Can someone please give me an example or something? Either Excel or Google Spreadsheets.
Im trying to teach myself doing so but using this website https://newworldstatus.com/regions/us-east for a small project I want to do.
Thank you in advance.
This is not a comprehensive answer, just intended to show you how some very basic concepts work. Second, an answer for Sheets, but let me preface all of this by saying that while your test URL seems simple enough, you will not be able to do any of this for that specific URL. They are either actively trying to stop scraping or they just have it set up in a way that makes it difficult to scrape by accident. If you directly make a web request to that URL, you will get back the JS code that actually handles the data load-in and not the data itself, so any kind of parsing you try to do will fail because what you see in the page isn't what is actually coming back on the initial page request. All the html that will be in the page is enough to show this:
You would need to either try to read through the code and figure out what they're doing, or do some tinkering in the javascript console, and probably some fairly high-level tinkering. So for a first project, or just to learn some basics, I think I would pick a different test case.
First, in VBA. It's both complicated and not all that complicated at the same time. If you know how web technologies work non-language specifically, then it all works pretty much the same way in VBA. First, you'll need to make a web request. You can do that with the winHTTP library or the msXML library. I usually use winHTTP, but unless what you're doing is complex, either one is fine.
WEB REQUEST:
You'll need to instantiate a request object. You can do that by either adding a reference to the library (tools->references-> and pick the library out of the list) or you can use late binding. I prefer to add the reference, because you get intellisense that way. Here are both:
Dim req As New WinHttp.WinHttpRequest
or
Set req = CreateObject("WinHttp.WinHttpRequest.5.1")
Then you open the request. I'm going to assume this is a straight GET. POST requests get a little more complicated:
req.Open "GET", url, TRUE
If you have the reference added and created the req with Dim, then you'll get the intellisense and as you type that the arguments will pop up and you can use that to refer to the documentation if you have questions. TRUE here is to send it asynchronously, which I would do. If you don't, it will block up the interface. This the Open method, which you can find in the documentation.
https://learn.microsoft.com/en-us/windows/win32/winhttp/iwinhttprequest-interface
Then use
req.send
req.WaitForResponse
source = req.responseText
to send the request. WaitForResponse is needed only if you send the request asynchronously. The last part is to get the responseText into a variable.
PARSING:
Then you'll need to do some stuff with the MSHTML library, so add a reference to that. You can also late bind, but I would not, because it will be very helpful to you to have the prompts in intellisense.
First, set up a document
https://learn.microsoft.com/en-us/dotnet/api/mshtml.htmldocument?view=powershellsdk-1.1.0
and write the source you just fetched to it:
Dim doc as new MSHTML.HTMLdocument
doc.write source
Now you have a document object you can manipulate. The trick is to get a reference to the element you want. There are two methods that will return an element:
getElementById
querySelector
If you are lucky, the element you are looking for will have a unique ID and you can just get it. If not so lucky, you can use a selector that identifies it uniquely. In either case, you will set up an IHTMLElement to return to:
Dim el as MSHTML.IHTMLElement
set el = doc.getElementById("uniqueID") 'whatever the unique ID is
Once you have that, you can use the methods and properties of the element to return information about it:
https://learn.microsoft.com/en-us/dotnet/api/mshtml.ihtmlelement?view=powershellsdk-1.1.0
There are more specific interfaces, like
https://developer.mozilla.org/en-US/docs/Web/API/HTMLAnchorElement
You can use the generic IHTMLElement, but sometimes there are advantages to using a specific element type, for instance, the properties that are available to it.
Sometimes you will have to set up an IHTMLElementCollection:
https://learn.microsoft.com/en-us/previous-versions/windows/internet-explorer/ie-developer/platform-apis/aa703928(v=vs.85)
and iterate it to find the specific element you are looking for. There are four methods that return collections:
getElementsByName
getElementsByTagName
getElementsByClassName
querySelectorAll
getElementsByClassName is sometimes problematic, so forewarned is forearmed.
If you need to do that, set up and IHTMLElementCollection and return the list to that:
dim els as MSHTML.IHTMLElementCollection
set els = doc.getElementsByTagName("tagName") 'for instance a for anchors, div for divs
That is about it. There is obviously more to it, but a comprehensive answer would be very long. This is mostly intended to point you in the right direction and give you more stuff to google.
I will say that you should test out some of these methods in the browser first. They exist in many languages, and all major browsers have developer tools. For Chrome, for instance, press Ctrl+Shift+I to bring up the dev tools, and then in the console window type something like:
document.getElementById("uniqueID")
and you should get the node. or
document.getElementsByClassName(".test") 'where test is the name of the class
document.querySelectorAll("div") ' where you pass a valid CSS selector
and you will get the node list.
It will be quicker to experiment there than to try to set it up and debug in VBA. once you have a good handle on how it works, try to transfer that knowledge to a VBA implementation.
Here is a basic overview of .querySelector to get you started on understanding how those work, although they can get very complicated. In fact, querySelector is my go to method for finding elements.
https://www.w3schools.com/jsref/met_document_queryselector.asp
Now, Google Sheets:
You don't really want to use IMPORTHTML, even though it seems counterintuitive. That function (AFAIK) only supports tables and lists, and it's index based, too, which means you give it a number n and it returns the nth table or list in the page. That means that if they ever change the layout, or the layouts are dynamic in any way, then you won't be able to rely and an index to accurately identify what you want. Also, as you noted people don't really use tables much anymore, and when they say list I'm pretty sure they mean on and elements, which is also not going to be that useful to you. Here's the docs:
https://support.google.com/docs/table/25273?hl=en&visit_id=637732757707317357-1855795725&rd=2
But you can use IMPORTXML. Even though it says XML, you can still use it to parse HTML (for reasons and with limitations that are out of scope for this answer). IMPORTXML takes a URL and an xpath selector. In this way it's similar to the document.querySelector and querySelectorAll methods. Here is some information on xpath in tutorial from from w3schools.
https://www.w3schools.com/xml/xpath_intro.asp
And if you want to test selectors in Chrome you can use $x("selector") in the javascript console in the dev tools. I believe Firefox also supports this, but I am not sure if other browsers do. If not, you can use document.evaluate:
https://developer.mozilla.org/en-US/docs/Web/API/Document/evaluate
Even though you can't actually use this in sheets against the URL you've given, let's take a look at a couple of xpath selectors in that context. Hit Ctrl+Shift+I to bring up the dev tools (hopefully you are using Chrome), and then go to the elements tab. If you don't have the javascript console showing in the bottom pane, hit Esc. You should see something like this:
Use the arrow icon in the top left of the dev tools to search the elements, and just click on the first row in the table:
so that you can see the structure of the elements, and figure out how to parse out what you want from it. You'll notice that the cell that's highlighted is contained in a div with a role of "row" and an attribute of row-id. I think that's where I would start. So an xpath to that container would look something like this:
//div[#row-id=1]
where we are fetching all elements (//) that match div and have an attribute (#) of row-id = 1.
If you want to get the children of that container, you just add another level to the path
//div[#row-id=1]/div
where we want to get all children (/) that are divs.
And I notice that they all have a col-id attribute, so if you wanted to fetch the "set" information you'd just specify divs that have an attribute of col-id = 'set':
//div[#row-id=1]/div[#col-id='set']
and to get the text out of that:
//div[#row-id=1]/div[#col-id='set']/text()[1]
since it looks like the second node is the one that has the team name in it. Again, you can see how this WOULD work in the dev tools, but you won't actually be able to use this for your URL.
I'm not going to spend a lot of time here. As already stated, you won't be able to use this method on your specific URL. If you can figure out the actual URL that your URL wraps around, then perhaps. Also, since there's only one argument, the selector, then there's not much more to expound on. If you needed something more complex, like the ability to iterate over a set of matching nodes, you could probably do it in Scripts, but I would probably just switch to Excel if it started getting that complicated. The only exception would be if the data was JSON formatted, in which case Scripts will be able to handle that better than VBA, although I would probably switch to a different language entirely in that case.
Since your URL is probably not good for testing, I'm going to point you to this tutorial from Geckoboard, which has a few different examples from sites like Wikipedia and Pinterest.
https://www.geckoboard.com/blog/use-google-sheets-importxml-function-to-display-data/
So google around, experiment, and let me know if you need any help. And this was all off the top of my head, so let me know if any of this stuff throws errors so I can edit the answer.
Also, be aware that Excel is not always the right tool for dealing with this. Very often, while the page might have the elements you are looking for, they will be loaded in with JSON and both php and javascript can natively handle JSON objects, while VBA doesn't. If the data is JSON formatted, it is much easier to parse it out of that than trying to parse it out of the DOM structure (DOM = document object model, another thing to google). Also, in many cases, if the data is loaded in with AJAX, it won't be returned with your winHTTP call, because that doesn't execute any javascript that might be in the page.
Further, in many cases you will need to set headers or cookies in the winHTTP call to get the data (calls without the right setings might return an error or a redirect). That is also not addressed in my answer, although you can set headers and cookies in winHTTP. You would need to sniff the calls, either with Fiddler or similar or with the network tab in dev tools, to find out the right combination of information to pass with your request.
Hi! I searched google already but found only 1 page mentioning how to do it in MS Access but not in MS Excel, here:
List an MS Access form's controls with their events.
I know how to list the controls on a userform. What I want to know is how to get to the list of events available to a control in the code editor (just clarifying).
Is there a way to programmatically list all available event procedures for a control on a userform like a command button, so that I can add that list of events to an array/collection/dictionary for other uses.
I can't find the control.properties as mentioned the referred link above since I am working in Excel, where this property is not exposed, I guess.
I wish I could go through a collection of properties/events like "for each oneEvent in oneControl.Events" and I know that it is not possible.
I think there must be an internal list/collection like that because we can go through Object Browser in VBA Editor or write event handler code in the VBA Editor.
Is there a way to access that list/collection (may be through VBIDE.VBProject)?
Many thanks in advance.
Hurrah!
I finally found the answer myself! Well, with a little help from #Tim Williams.
It took me 5 days to research the topic of TLI.
Even when my country, Myanmar, was under siege by an unlawful and awful military coup (on 01FEB2021), I couldn't stop thinking about this TLI issue I am facing, despite having to live through anguish, anger, pain, uncertainty and the feeling of being violated.
Now, I got it solved.
That said, I still don't really understand how the whole TLI thing works.
I think it would take me several years of reading on the subject matter to fully understand its declarations, mechanism, functions and methods.
Anyway, I will share what I found out through reading a lot of webpages and finally getting to the answer using a simple watch window to figure out how to get to the list of event procedure names that are available to a given userform control in VBA.
The following sub was taken from another stackoverflow page about listing the properties of userform object, I got stuck with it because I don't fully understand how the structure of the return data is formatted but that was overcome by looking at the structure using the Watch window in VBA Editor.
Requirement: Reference to TypeLib Information library at C:\Windows\SysWow64\TLBINF32.DLL
Sub listControlEventNames(ByVal o As Object)
Dim t As TLI.TLIApplication
Set t = New TLI.TLIApplication
Dim ti As TLI.TypeInfo
Set ti = t.ClassInfoFromObject(o)
Dim i As Integer
Dim mi As TLI.MemberInfo
For Each mi In ti.DefaultEventInterface.Members
i = i + 1
ActiveSheet.Cells(i, 1).value = mi.Name
Next
End Sub
This sub should be called with a userform control name like:
call listControlEventNames(UserForm1.ToggleButton1)
or
listControlEventNames Userform1.ToggleButton1
Finally, I can swear to God (but I'm a Free Thinker) that I can't really find how to list UserForm Control Events anywhere on the web, let alone a user manual on this library. The many many example code snippets, explanations and discussions that I found on TLI, were, for listing properties of UserForm Controls only.
Anyway, enjoy!
I have an Excel-model used for extracting data via S&P Capital IQ Excel-plugin (COM-Addin). This model has been used for a few years without any issues, but now I am getting an error on this line of code:
Application.CommandBars.FindControl(Tag:="menurefreshdatacell").Execute
I have tried searching around and have found that the CommandBars may have been replaced by the Microsoft Office Fluent user interface. However, I cannot seem to find a solution or example of how this may have changed the piece of code from above.
Been a while since i've been using VBA, so unfortunately I have not been able to solve the issue myself, hope to find some more competent people in here.
Below is the entire piece of code. Which selects a designated area, and presses an "Refresh Selection"-button in the plugin.
Let me know if I have left out anything, thank you very much in advance!
Sub Update_FactSet_formulas()
Application.ScreenUpdating = False
Sheets("Peer group_segments").Select
Range("PG_seg_data").Select
Application.CommandBars.FindControl(Tag:="menurefreshdatacell").Execute
Below is the fomula that is provided by the plugin author, this has not changed for years, and what the above should reflect:
Public Sub RefreshSelection()
Dim Refreshbutton As CommandBarButton
Set Refreshbutton =
Application.CommandBars.FindControl(Tag:="menurefreshdatacell")
Refreshbutton.Execute
End Sub
I found the following worked for me:
Application.Run("CIQKEYS_RefreshSelection")
The function comes from the ciqfunctions library, but you don't have to add the reference for it to work. However, you can add the library by doing the following:
Tools
References
Check ciqfunctions
click OK
To see all of the functions available in the library:
Go to View
Select Object Brow
Check ciqfunctions
from the dropdown menu in the top left corner that likely says "", select ciqfunctions
Under Classes, select IUdf
The functions should appear in the "Members" column
Again, note that for the code to work for me, I didn't need to have the library added to references, I just wanted to outline how to explore the library, should you choose to.
There are also the following refresh functions:
CIQKEYS_RefreshWorkBook
CIQKEYS_RefreshWorkBookRangeV
CIQKEYS_RefreshWorkSheet
CIQKEYS_RefreshWorkSheetRangeV
Hope this works for you!
I have a small problem, I need to loop through excel rows
However, the code broke. If I am just running it step by step by cliking F8, it is ok. but if I want to run it just by F5, it breaks...
I guess it is not to fast to connect to external source, but how can I solve such problem?
lDate = ws.Cells(i, 2).Value
With appIE
.Navigate "www.thisisjusttheexampleweb.com"
'it usually breaks here or below
End With
Do While appIE.Busy
DoEvents
Loop
'or it breaks here on setting allRowOfData
Set allRowOfData = appIE.document.getElementById("Table")
Set InterestingFields = allRowOfData.Children
Thanks for all the help
P.
Edit: I found other solution without parsing any web, what is more it is better to use some API to get the data for free, and very often the web's owners don't allow to parse/scrape any info from the web.
As for the problem you're asking about in your question, the problem is that the browser changes its "Busy" status before the page is completely loaded. This might happen, so the statement:
Do While appIE.Busy
DoEvents
Loop
gets out of the loop before the page is really loaded. What happens then? Well, when you try to set your statement:
Set allRowOfData = appIE.document.getElementById("historicalRateTb1")
...it fails because such element with the ID you look for is not in the document of your IE application yet. A "not very nice" way to solve this, but that should answer your question, usually consits of let the Excel application waiting some time before to run the rest of the code, so that the browser will have the material time to load the full page:
Application.Wait TimeSerial(Hour(Now()), Minute(Now()), Second(Now())+5) 'this waits 5 seconds before the code run is restored
However, now that you seem to be a bit more acquainted with IE automation, my suggestion is to start learning the use of the so-called XMLHTTP Request object. Find more info about it here. I aware you in advance that:
The good aspect of this object is the speed, since it queries the data in XML and is way faster than the human robotization of the browser to get the same HTML body where your data lie;
The "bad" aspect, apart for the fact that the website might not provide this service, is that usually the object uses the cache to avoid the call; in other words, if you queried the data at 19:52, when you will query at 19:53 it will rather take the cached data of your browser at 19:52 and not the new data at 19:53. There should be a property into the object to avoid this, or you might just append a random number to the website string (that will not be read in the URL but will force the new query to the website).
I am a rookie to programming and am pretty sure that I am way over my head, but here is my problem. I work at a manufacturing facility that has a server that I do not have (and cannot have) access to. In order to retrieve and analyze data I have to go onto our Quality Management website which presents multiple drop down menus to select from. The first one is "Manufacturing date", then time, then machine ID, and then run the search. This then opens a new window that contains the data I need to insert into a previously written macro to organize. My buddy from work found this program:
Sub GoogleSearch()
Set objIE = CreateObject("InternetExplorer.Application")
WebSite = "http://mant6websrv1.thermatru.com/QMS/QualityReporting/Reports/LDMasterQueryReport.asp"
With objIE
.Visible = True
.navigate WebSite
Do While .Busy Or .readystate <> 4
DoEvents
Loop
Set Element = .document.getElementsByName("q")
Element.Item(0).Value = "Hello world"
.document.forms(0).submit
'.quit
End With
End Sub
It was originally a google search program that opens google in explorer and then searches the phrase "hello world". By simply replacing the website with our Quality Management address we have been able to get it to open the correct site, but I have no idea how to get it to select the options I want. I tinkered around a bit with the webdeveloper for explorer and even found the ID names for the boxes I need to fill out...but have no idea how to get the program to select them, and then insert the dates I want and run the search. Any ideas??
p.s. when I say I am a rookie I mean that this is my third day writing macros with vba so please be as simplistic and clear as possible with any answers. I am starting to really enjoy this stuff and want to learn as much as possible. Thanks everyone.
Try using the web macro recorder in Open Twebst. Record the actions you want to take and put the code in Excel's VB editor. In your case you can use VBScript, which can for the most part also be used with VBA. You might not get perfect result immediately with a recording, but it's a good tool to help you get on the right track, even if you may need to tinker with the code a bit and look through the HTML of the site to get perfect control.
Note however that it has its own framework for web automation, it's not recording using the standard Microsoft web libraries for it.