Pulling Data from HTML Tables

Pulling Data from HTML Tables - excel

So I understand how to pull data from a single weblink looking at tables. I cannot find not 1 tutorial anywhere on the web about how to do so getting it from Div elements and no one talks about it at all. Can someone please give me an example or something? Either Excel or Google Spreadsheets.
Im trying to teach myself doing so but using this website https://newworldstatus.com/regions/us-east for a small project I want to do.
Thank you in advance.

This is not a comprehensive answer, just intended to show you how some very basic concepts work. Second, an answer for Sheets, but let me preface all of this by saying that while your test URL seems simple enough, you will not be able to do any of this for that specific URL. They are either actively trying to stop scraping or they just have it set up in a way that makes it difficult to scrape by accident. If you directly make a web request to that URL, you will get back the JS code that actually handles the data load-in and not the data itself, so any kind of parsing you try to do will fail because what you see in the page isn't what is actually coming back on the initial page request. All the html that will be in the page is enough to show this:
You would need to either try to read through the code and figure out what they're doing, or do some tinkering in the javascript console, and probably some fairly high-level tinkering. So for a first project, or just to learn some basics, I think I would pick a different test case.
First, in VBA. It's both complicated and not all that complicated at the same time. If you know how web technologies work non-language specifically, then it all works pretty much the same way in VBA. First, you'll need to make a web request. You can do that with the winHTTP library or the msXML library. I usually use winHTTP, but unless what you're doing is complex, either one is fine.
WEB REQUEST:
You'll need to instantiate a request object. You can do that by either adding a reference to the library (tools->references-> and pick the library out of the list) or you can use late binding. I prefer to add the reference, because you get intellisense that way. Here are both:
Dim req As New WinHttp.WinHttpRequest
or
Set req = CreateObject("WinHttp.WinHttpRequest.5.1")
Then you open the request. I'm going to assume this is a straight GET. POST requests get a little more complicated:
req.Open "GET", url, TRUE
If you have the reference added and created the req with Dim, then you'll get the intellisense and as you type that the arguments will pop up and you can use that to refer to the documentation if you have questions. TRUE here is to send it asynchronously, which I would do. If you don't, it will block up the interface. This the Open method, which you can find in the documentation.
https://learn.microsoft.com/en-us/windows/win32/winhttp/iwinhttprequest-interface
Then use
req.send
req.WaitForResponse
source = req.responseText
to send the request. WaitForResponse is needed only if you send the request asynchronously. The last part is to get the responseText into a variable.
PARSING:
Then you'll need to do some stuff with the MSHTML library, so add a reference to that. You can also late bind, but I would not, because it will be very helpful to you to have the prompts in intellisense.
First, set up a document
https://learn.microsoft.com/en-us/dotnet/api/mshtml.htmldocument?view=powershellsdk-1.1.0
and write the source you just fetched to it:
Dim doc as new MSHTML.HTMLdocument
doc.write source
Now you have a document object you can manipulate. The trick is to get a reference to the element you want. There are two methods that will return an element:
getElementById
querySelector
If you are lucky, the element you are looking for will have a unique ID and you can just get it. If not so lucky, you can use a selector that identifies it uniquely. In either case, you will set up an IHTMLElement to return to:
Dim el as MSHTML.IHTMLElement
set el = doc.getElementById("uniqueID") 'whatever the unique ID is
Once you have that, you can use the methods and properties of the element to return information about it:
https://learn.microsoft.com/en-us/dotnet/api/mshtml.ihtmlelement?view=powershellsdk-1.1.0
There are more specific interfaces, like
https://developer.mozilla.org/en-US/docs/Web/API/HTMLAnchorElement
You can use the generic IHTMLElement, but sometimes there are advantages to using a specific element type, for instance, the properties that are available to it.
Sometimes you will have to set up an IHTMLElementCollection:
https://learn.microsoft.com/en-us/previous-versions/windows/internet-explorer/ie-developer/platform-apis/aa703928(v=vs.85)
and iterate it to find the specific element you are looking for. There are four methods that return collections:
getElementsByName
getElementsByTagName
getElementsByClassName
querySelectorAll
getElementsByClassName is sometimes problematic, so forewarned is forearmed.
If you need to do that, set up and IHTMLElementCollection and return the list to that:
dim els as MSHTML.IHTMLElementCollection
set els = doc.getElementsByTagName("tagName") 'for instance a for anchors, div for divs
That is about it. There is obviously more to it, but a comprehensive answer would be very long. This is mostly intended to point you in the right direction and give you more stuff to google.
I will say that you should test out some of these methods in the browser first. They exist in many languages, and all major browsers have developer tools. For Chrome, for instance, press Ctrl+Shift+I to bring up the dev tools, and then in the console window type something like:
document.getElementById("uniqueID")
and you should get the node. or
document.getElementsByClassName(".test") 'where test is the name of the class
document.querySelectorAll("div") ' where you pass a valid CSS selector
and you will get the node list.
It will be quicker to experiment there than to try to set it up and debug in VBA. once you have a good handle on how it works, try to transfer that knowledge to a VBA implementation.
Here is a basic overview of .querySelector to get you started on understanding how those work, although they can get very complicated. In fact, querySelector is my go to method for finding elements.
https://www.w3schools.com/jsref/met_document_queryselector.asp
Now, Google Sheets:
You don't really want to use IMPORTHTML, even though it seems counterintuitive. That function (AFAIK) only supports tables and lists, and it's index based, too, which means you give it a number n and it returns the nth table or list in the page. That means that if they ever change the layout, or the layouts are dynamic in any way, then you won't be able to rely and an index to accurately identify what you want. Also, as you noted people don't really use tables much anymore, and when they say list I'm pretty sure they mean on and elements, which is also not going to be that useful to you. Here's the docs:
https://support.google.com/docs/table/25273?hl=en&visit_id=637732757707317357-1855795725&rd=2
But you can use IMPORTXML. Even though it says XML, you can still use it to parse HTML (for reasons and with limitations that are out of scope for this answer). IMPORTXML takes a URL and an xpath selector. In this way it's similar to the document.querySelector and querySelectorAll methods. Here is some information on xpath in tutorial from from w3schools.
https://www.w3schools.com/xml/xpath_intro.asp
And if you want to test selectors in Chrome you can use $x("selector") in the javascript console in the dev tools. I believe Firefox also supports this, but I am not sure if other browsers do. If not, you can use document.evaluate:
https://developer.mozilla.org/en-US/docs/Web/API/Document/evaluate
Even though you can't actually use this in sheets against the URL you've given, let's take a look at a couple of xpath selectors in that context. Hit Ctrl+Shift+I to bring up the dev tools (hopefully you are using Chrome), and then go to the elements tab. If you don't have the javascript console showing in the bottom pane, hit Esc. You should see something like this:
Use the arrow icon in the top left of the dev tools to search the elements, and just click on the first row in the table:
so that you can see the structure of the elements, and figure out how to parse out what you want from it. You'll notice that the cell that's highlighted is contained in a div with a role of "row" and an attribute of row-id. I think that's where I would start. So an xpath to that container would look something like this:
//div[#row-id=1]
where we are fetching all elements (//) that match div and have an attribute (#) of row-id = 1.
If you want to get the children of that container, you just add another level to the path
//div[#row-id=1]/div
where we want to get all children (/) that are divs.
And I notice that they all have a col-id attribute, so if you wanted to fetch the "set" information you'd just specify divs that have an attribute of col-id = 'set':
//div[#row-id=1]/div[#col-id='set']
and to get the text out of that:
//div[#row-id=1]/div[#col-id='set']/text()[1]
since it looks like the second node is the one that has the team name in it. Again, you can see how this WOULD work in the dev tools, but you won't actually be able to use this for your URL.
I'm not going to spend a lot of time here. As already stated, you won't be able to use this method on your specific URL. If you can figure out the actual URL that your URL wraps around, then perhaps. Also, since there's only one argument, the selector, then there's not much more to expound on. If you needed something more complex, like the ability to iterate over a set of matching nodes, you could probably do it in Scripts, but I would probably just switch to Excel if it started getting that complicated. The only exception would be if the data was JSON formatted, in which case Scripts will be able to handle that better than VBA, although I would probably switch to a different language entirely in that case.
Since your URL is probably not good for testing, I'm going to point you to this tutorial from Geckoboard, which has a few different examples from sites like Wikipedia and Pinterest.
https://www.geckoboard.com/blog/use-google-sheets-importxml-function-to-display-data/
So google around, experiment, and let me know if you need any help. And this was all off the top of my head, so let me know if any of this stuff throws errors so I can edit the answer.
Also, be aware that Excel is not always the right tool for dealing with this. Very often, while the page might have the elements you are looking for, they will be loaded in with JSON and both php and javascript can natively handle JSON objects, while VBA doesn't. If the data is JSON formatted, it is much easier to parse it out of that than trying to parse it out of the DOM structure (DOM = document object model, another thing to google). Also, in many cases, if the data is loaded in with AJAX, it won't be returned with your winHTTP call, because that doesn't execute any javascript that might be in the page.
Further, in many cases you will need to set headers or cookies in the winHTTP call to get the data (calls without the right setings might return an error or a redirect). That is also not addressed in my answer, although you can set headers and cookies in winHTTP. You would need to sniff the calls, either with Fiddler or similar or with the network tab in dev tools, to find out the right combination of information to pass with your request.

Related

what are disadvantages of using ":xpath" attribute to identify an object using Watir?

I am using :xpath attribute frequently to identify an element for my automation scripts using Watir and found it really amazing. It is least changing attribute so less work to maintain automated scripts.. off course for those elements which can't be identified otherwise easily through :id, :name, :value attributes..
I am bit concerned to take some expert advise before building so many automated scripts using :xpath.
What is disadvantage of using :xpath to identify an object using Watir?
Do :xpath value of an element will be same in IE, Chrome and FF?when
Is there anything else important i should be aware about using :xpath?
Thanks

The xpath should always be the same in all browsers.
The problem with using xpath is that it is the easiest locator to break, as the locator for the element is dependant on nothing else in that xpath changing. e.g. if you are locating a results table on a page using an xpath and at a later date another table gets added above the table, then the xpath will be broken and your tests will fail until you update the xpath. If that table was located using an id then adding the second table wouldn't break anything as the new table would have a different id.
If the pages you're working on don't have id's and it isn't an option to add some/ ask for some to be added then remember that in watir you can use multiple locators.
e.g. #browser.table(class: 'results_table', text: /Original results table/)
This is a silly example but hopefully it illustrates the point. If there are cases when using multiple locators still won't work for any reason, then I would look into using css selectors instead of xpath as you should be able to achieve the same things but it will be less brittle.
The issue of how often tests break isn't too important in a small test suite, especially if you're the only one working on the tests. However, a couple of years from now when you have hundreds of tests to maintain and two or three people sharing the codebase you can end up spending longer fixing old tests that you spend writing the new ones. It's worth doing anything you reasonably can to minimise this as you go along as doing a rewrite later will always take longer.
Hopefully some of this helps!

VBA MSXML2 Open member POST

In this code snipet example...
Dim pullSite As String
Dim pullXMLHTTP As MSXML2.XMLHTTP
Set pullXMLHTTP = New MSXML2.XMLHTTP
pullSite = "http://www.ThisIsASite.com/Documents/whatever........xml"
pullXMLHTTP.Open "POST", strXMLSite, False
...I am confronted with the word Open in the code which I reckon is a method since it's not being equaled to anything (in the OB screen at the bottom left once picked it says it's a Sub, but I rest my case on OB naming methods as Sub and Function- I guess it wants to say this is a class browser so we state the method as Sub or Function just like when we "custom" make them in a class module but I am not sure of that). If I go to the Object Browser in the MSXML2 Library I will find in XMLHTTP the Open listed as a Member.
Well if I hit F1 VBA's help has 3 choices. First choice (VBA Library) one tells me it is the Open Statement which clearly isn't.
The Second choice (Excel Library) has an RecentFile.Open Method without parameters so it doesn't buy that either.
The Third choice (Office Library) gives me the Developer Reference main menu.
So F1 VBA's Auto help has nothing for the MSXML2 Library. However in the OB's lower left bottom if I choose from the MSXML2.XMLHTTP Library the Open Member it states
Sub open(bstrMethod As String, bstrUrl As String, [varAsync], [bstrUser], [bstrPassword])
Member of MSXML2.XMLHTTP
Open HTTP connection
So my question is twofold
Is there a solid help which documents the members of the MSXML2 Library in detail/or a way to add that on the VBA' s Autohelp? (BTW Microsoft's Development Center on MSXML MSXLM is vast and to vague. I also believe that it points to experienced innuendos...
Why do I see in relevant examples where the MSXML2.HTTP Open member is used, the first String parameter always populated with the word "POST" in Cap letters? It's mind boggling to me-can't they use another String value?
After #codeape 's submitted answer:
Dear codeape so according to you answer, we have the
open Method (IXMLHTTPRequest)
and the parameters in an existing VBA example (which MS lacks) all-together form the IXMLHTTPRequest so:
Open methods Paramenters --> IXMLHTTPRequest
Please confirm if this is true so that I will edit this post in a more user-friendly all confirmed instructions for others to use.
Also I can't help my curiosity why do we have an I in the IXMLHTTPRequest naming? Just wondering...
In regards to the POST Question thanks to your answer I googled HTTP Method and found this neat link in W3Schools which to quite an extent explains to me. Really why not use GET? It looks more fast-forward... instead of submitting data to be processed by a specified source (with POST) just request the data from the specified source in one stroke it looks more swift and simple...
Oooups I almost forgot in the open Method link you gave me as well as in the bottom of OB we see bstrMethod As String, bstrUrl As String etc. Could you tell me what that bstr means? - Is the BSTR means Basic String or Binary String? please confirm

The documentation for XMLHTTP is here. Note that the documentation page is for the interface IXMLHTTPRequest. MSXML.XMLHTTP implements this interface.
The Open method's first argument is the HTTP method, i.e. GET or POST (or PUT, DELETE, etc.).
Sadly it looks as if MS has forgotten about VBA when it comes to documenting various COM libraries. Most/all documentation is for C++ and javascript.
Comments on other things mentioned in the question:
The "I" in IXMLHTTPRequest is a naming convention in COM used for naming interfaces.
Regarding HTTP methods (GET, POST etc.):
The server-side code decides how to interpret and respond to the HTTP requests.
When you type a URL into you browser's URL bar, the browser sends a GET request to the server. So if you want to fetch a web page using MSXML.XMLHTTP you do: obj.Open "GET", "http://www.google.com"
When you submit a HTML form from your browser, the browser can either send a GET or a POST request depending on method attribute of the <form> tag. <form method="GET"> (the default) or <form method="POST">.
Javascript on the page can send AJAX requests using XMLHTTPRequest with any HTTP method.
Regarding argument names bstrMethod etc.: This is hungarian notation, where an argument's expected type is reflected in the argument's name to make things easier for the developer. The bstr means that this argument is a COM "basic string". A basic string is what COM typically uses for unicode strings. In VBA, you can use normal VBA strings where a method expects a bstr.

Any way in Expression Engine to simulate Wordpress' shortcode functionality?

I'm relatively new to Expression Engine, and as I'm learning it I am seeing some stuff missing that WordPress has had for a while. A big one for me is shortcodes, since I will use these to allow CMS users to place more complex content in place with their other content.
I'm not seeing any real equivalent to this in EE, apart from a forthcoming plugin that's in private beta.
As an initial test I'm attempting to fake shortcodes by using delimited strings (e.g. #foo#) in the content field, then using a regex to pull those out and pass them to a function that can retrieve the content out of EE's database.
This brings me to a second question, which is that in looking at EE's API docs, there doesn't appear to be a simple means of retrieving the channel entries programmatically (thinking of something akin to WP's built-in get_posts function).
So my questions are:
a) Can this be done?
b) If so, is my method of approaching it reasonable? Or is there something stupidly obvious I'm missing in my approach?
To reiterate, my main objective here is to have some means of allowing people managing content to drop a code in place in their content that will be replaced with channel content.
Thanks for any advice or help you can give me.

Here's a simple example of the functionality you're looking for.
1) Start by installing Low Replace.
2) Create two Global Variables called gv_hello and gv_goodbye with the values "Hello" and "Goodbye" respectively.
3) Put this text into the body of an entry:
[say_hello]
Nice to see you.
[say_goodbye]
4) Put this into your template, wrapping the Low Replace tag around your body field.
{exp:low_replace
find="[say_hello]|[say_goodbye]"
replace="{gv_hello}|{gv_goodbye}"
multiple="yes"
}
{body}
{/exp:low_replace}
5) It should output this into your browser:
Hello
Nice to see you.
Goodbye
Obviously, this is a really simple example. You can put full blown HTML into your global variable. For example, we've used that to render a complex, interactive graphic that isn't editable but can be easily dropped into a page by any editor.
Unfortunately, due to parse order issues, EE tags won't work inside Global Variables. If you need EE tags in your short code output, you'll need to use Low Variables addon instead of Global Variables.

Continued from the comment:
Do you have examples of the kind of shortcodes you want to support/include? Because i have doubts if controlling the page-layout from a text-field or wysiwyg-field is the way to go.
If you want editors to be able to adjust layout or show/hide extra parts on the page, giving them access to some extra fields in the channel, is (imo) much more manageable and future-proof. For instance some selectfields, a relationship (or playa) field, or a matrix, to let them choose which parts to include/exclude on a page, or which entry from another channel to pull content from.
As said in the comment: i totally understand if you want to replace some #foo# tags with images or data from another field (see other answers: nsm-transplant, low_replace). But, giving an editor access to shortcodes and picking them out, is like writing a template-engine to generate ee-template code for the ee-template-engine.
Using some custom fields to let editors pick and choose parts to embed is, i think, much more manageable.
That being said, you could make a plugin to parse the shortcodes from a textareas content, and then program a lot, to fetch data from other modules you want to support. For channel entries you could build out of the channel data library by objectiveHTML. https://github.com/objectivehtml/Channel-Data

I hear you, I too miss shortcodes from WP -- though the reason they work so easily there is the ubiquity of the_content(). With the great flexibility of EE comes fewer blanket solutions.
I'd suggest looking at NSM Transplant. It should fit the bill for you.

There is also a plugin called Shortcode, which you can find here at
Devot-ee
A quote from the page:
Shortcode aims to allow for more dynamic use of content by authors and
editors, allowing for injection of reusable bits of content or even
whole pieces of functionality into any field in EE

Client id of elements changes in sharepoint page? Why? When?

Client id of every element from the sharepoint page changes sometimes.
Can anybody please tell me why and on which instance it changes???

jQuery is fantastic! It makes client-side development faster and
countless plug-ins are available for just about every need. Using
jQuery with Asp.NET Web-Forms gets aggravating when dealing with
nested server controls. ClientID’s get appended when using ASP.NET
Master Pages. Objects in JavaScript tend to look like this:
ctl00_m_g_aaf13d41_fc78_40be_81d5_2f40e534844f_txtName
The difficulty of the issue above is that, in order to get the element txtName, It’s necessary to know the full “path”. It’s quite
aggravating to refer to get the object using the method below:
document.getElementByID('ctl00_m_g_aaf13d41_fc78_40be_81d5_2f40e534844f_txtName');
This becomes a big problem when developing server controls or web parts that may be used in a typical ASP.NET application or SharePoint.
You cannot hard-code the path above if you don’t know the full path of
the control.
Fortunately, there are a few ways to get around this. There are three, in particular, I will mention. The first is the jQuery
equivalent to the standard JavaScript method:
document.getElementById("<%=txtName.ClinetID%>");");
This can be done in jQuery by using:
$("#'<%=txtName.ClinetID%>");");
The second jQuery method does not require server tags. This method searches through all tags and looks for an element ending with the
specified text. The jQuery code for this method is shown below:
$("[id$='_txtName']");
There are, of course, drawbacks to both methods above. The first is fast, but requires server tags. It’s fast, but it just looks messy.
Also, it will not work with external script files. The second
alternative is clean, but it can be slow. As I said earlier, there are
several other alternatives, but these two are the ones I find myself
using the most.
The third registering Javascript in C# code behind.
Page.ClientScript.RegisterStartupScript(GetType(), "saveScript",
String.Format("function EnableSave( isDisabled )"+
"{{ var saveButton = document.getElementById(\"{0}\");"+
"saveButton.disabled=isDisabled;}}", btnSave.ClientID), true);
Do not forget to call this script after controls have been loaded, I mean after Controls.Add(); in CreateChildControls method while
developing webparts.

$('input[title="Name"]')
Look at the page source and get the value of the title property - works every time.

ListBox1.Attributes.Add("onmouseup",
"document.getElementById('" + base.ClientID + "_" + lbnFilter.ClientID + "').style.display='block';");

How to add text to any html element?

I want to add text to body element but I don't know how. Which method will work on the body tag?
Sorry for my english and thanks for replies.

In Watir, you can manipulate a web page (DOM) using JS, just like that:
browser.execute_script("document.getElementById('pageContent').appendChild(document.createTextNode('Great Success!'));")
I assume that the point of the question is:
All users are not just interacting by just clicking buttons and links on the web app, some of them are doing nasty things like altering http requests to make your system do something that it is not supposed to do... or to just have some fun.
To mimic this behavior, you could write a ui-test that alters forms on the web page, so that for example, one could type in anything into any field instead of a limited dropdown.
To do that, ui test has to:
manipulate DOM to set form inputs free of limitations (replace select's with input's, etc.)
ui test has to know, which values to use, in many cases it's pointless to enter random values. Your webapp has to provide some good "unwanted" options.

Why would you want to modify the webpage in Watir? It's for automated testing, not DOM manipulation.
If you want to add something to the DOM element in javascript, you can do it like that:
var txt = document.createTextNode(" This text was added to the DIV.");
document.getElementById('myDiv').appendChild(txt);
Or use some DOM manipulation library, like jQuery.

If you have not worked your way though the watir tutorial, I would suggest you do so. It deals with things like filling in text fields etc.
Learn to use the developer tools for your browser, Firebug for Firefox, or the built in tools for IE and CHrome. They will let you look at things as you interact with the site.
If the element is not a normal HTML input field of some sort, then you are dealing with a custom control. Many exist and they are varied and there is no one set solution for dealing with them. Without knowing which control you are using, and being able ourselves to interact with a sample of it, or at least see the HTML, it is very very difficult to advise you, we basically have to just guess (which is often a waste of everyone's time)
Odds are if you have a place you can enter text, then it is some form of input control, it might not start out that way, you may need to click on some other element, to make the input area appear, but without a sample of HTML all we can do is guess.
If this is a commercial control, see if you can find a demo site that shows the control in action. Try googling things like class names for the elements and often you get lucky

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string