Easy way to get hyperlink info from rendered web page

Easy way to get hyperlink info from rendered web page - browser

I'd like do this programmatically:
Given a page URL, I need to get all links on the page. What's important is that at least 3 pieces of link info must be obtained: anchor text, href attribute value, absolute position of the link on the page.
Java CSSBox library is an option, but it's not fully implemented yet(the href attribute value cannot be obtained at the same time and some extra mapping must be done with additional library such as Jsoup). What's more, the CSSBox library renders a page really slow.
It seems that Javascript has all functions available but we have to inject the javascript code into the page and write a driver to take advantage of existing browsers. Scripting languages such as Python and Ruby have support for this as well. It is hard for me to find out the most handy tool.

Does PHP's DOM manipulation library help you? http://www.php.net/manual/en/book.dom.php

Related

In Chrome extensions, why use a background page with HTML?

I understand that the background page of a Chrome extension is never displayed. It makes sense to me that a background page should contain only scripts. In what situations would HTML markup ever be needed?
At https://developer.chrome.com/extensions/background_pages there is an example with an HTML background page, but I haven't been able to get it to work (perhaps because I am not sure what it should be doing).
Are there any examples of simple Chrome extensions which demonstrate how HTML markup can be useful in a background page?

Historical reasons
The background page is, technically, a whole separate document - except it's not rendered in an actual tab.
For simplicity's sake, perhaps, extensions started with requiring a full HTML page for the background page through the background_page manifest property. That was the only form.
But, as evidenced by your question, most of the time it's not clear what the page can actually be used for except for holding scripts. That made the entire thing being just a piece of boilerplate.
That's why when Chrome introduced "manifest_version": 2 in 2012 as a big facelift to extensions, they added an alternative format, background.scripts array. This will offload the boilerplate to Chrome, which will then create a background page document for you, succinctly called _generated_background_page.html.
Today, this is a preferred method, though background.page is still available.
Practical reasons
With all the above said, you still sometimes want to have actual elements in your background page's document.
<script> for dynamically adding scripts to the background page (as long as they conform to extension CSP).
Among other things, since you can't include external scripts through background.scripts array, you need to create a <script> element for those you whitelist for the purpose.
<canvas> for preparing image data for use elsewhere, for example in Browser Action icons.
<audio> for producing sounds.
<textarea> for (old-school) working with clipboard (don't actually do this).
<iframe> for embedding an external page into the background page, which can sometimes help extracting dynamic data.
..possibly more.
It's debatable which boilerplate is "better": creating the elements in advance as a document, or using document.createElement and its friends as needed.
In any case, a background page is always a page, whether provided by you or autogenerated by Chrome. You can use all the DOM functions you want.

My two cents:
Take Google Mail Checker as an example, it declares a canvas in background.html
<canvas id="canvas" width="19" height="19">
Then it could manipulate the canvas in background.js and call chrome.browserAction.setIcon({imageData: canvasContext.getImageData(...)}) to change the browser action icon.
I know we could dynamically create canvas via background.js, however when doing something involving DOM element, using html directly seems easier.

Why does React.js' API warn against inserting raw HTML?

From the tutorial
But there's a problem! Our rendered comments look like this in the
browser: "<p>This is <em>another</em> comment</p>". We want those tags
to actually render as HTML.
That's React protecting you from an XSS attack. There's a way to get
around it but the framework warns you not to use it:
...
<span dangerouslySetInnerHTML={{__html: rawMarkup}} />
This is a special API that intentionally makes it difficult to insert raw HTML, but for Showdown we'll take advantage of this backdoor.
Remember: by using this feature you're relying on Showdown to be secure.
So there exists an API for inserting raw HTML, but the method name and the docs all warn against it. Is it safe to use this? For example, I have a chat app that takes Markdown comments and converts them to HTML strings. The HTML snippets are generated on the server by a Markdown converter. I trust the converter, but I'm not sure if there's any way for a user to carefully craft Markdown to exploit XSS. Is there anything else I should be doing to make sure this is safe?

Most Markdown processors (and I believe Showdown as well) allow the writer to use inline HTML. For example a user might enter:
This is _markdown_ with a <marquee>ghost from the past</marquee>. Or even **worse**:
<script>
alert("spam");
</script>
As such, you should have a whitelist of tags and strip all the other tags after converting from markdown to html. Only then use the aptly named dangerouslySetInnerHTML.
Note that this also what Stackoverflow does. The above Markdown renders as follows (without you getting an alert thrown in your face):
This is markdown with a ghost from the past. Or
even worse:
alert("spam");

There are three reasons it's best to avoid html:
security risks (xss, etc)
performance
event listeners
The security risks are largely mitigated by markdown, but you still have to decide what you consider valid, and ensure it's disallowed (maybe you don't allow images, for example).
The performance issue is only relevant when something will change in the markup. For example if you generated html with this: "Time: <b>" + new Date() + "</b>". React would normally decide to only update the textContent of the <b/> element, but instead replaces everything, and the browser must reparse the html. In larger chunks of html, this is more of a problem.
If you did want to know when someone clicks a link in the results, you've lost the ability to do so simply. You'd need to add an onClick listener to the closest react node, and figure out which element was clicked, delegating actions from there.
If you would like to use Markdown in React, I recommend a pure react renderer, e.g. vjeux/markdown-react.

Why can the background page be an html file?

In manifest.json, we specify our background page and can put an html or a js file for it. Since it is only a script that executes what sense does it make to have an html file for it?
I mean where is UI going to get shown anyway?
Similarly the devtools_page property has to be an html file. What sense does that make?

It will not be shown anywhere (that's the essence of "background"), but some elements on it make sense.
You can have an <audio> tag, and if you play it, it will be heard.
You can have an <iframe> with some other page loaded invisibly.
..and so on
As for devtools_page, it would actually be visible in the interface (as an extra panel in the DevTools)
It is possible that devtools_page must be an HTML file just for legacy reasons: it was not updated when manifest version 2 rolled out with changes to how background pages are specified. Still, the same arguments as above apply.

background_page is a legacy feature from the initial support of extensions in Chrome. background.scripts was added in Chrome 18. I can't speak for Google's original intentions but I'd guess that in the original design using an page felt more natural and would be less likely to confuse developers. Once they realized how many background_pages were just being used to load JavaScript it made sense to explicitly support that.

How to detect page language/locale in a Chrome extension content script?

I would like my Chrome extension content script to detect the language or locale of the page's content (not the browser language/locale). I assume there is a method for this in the Chrome extension API, but should I be using standard Javascript libraries instead?

This is the Chrome extension method: chrome.tabs.detectLanguage(...). From the description:
Detects the primary language of the content in a tab.

You could use standard javascript DOM functions to look for a lang attribute on the root html element (or possibly the body element). But keep in mind that a page might not be entirely in one language, so different elements of the page may be marked up with different lang attributes.
Also, if you want to support xhtml, I'd suggest looking the xml:lang attribute as well.

Inserting a news-feed widget to a page

I have a page I'd like to embed a news-feed widget into (so that the feed from some remote site will be displayed in my site).
While there are quite a few free news-feed widgets available out there (a partial list is here: http://allwebco-templates.com/support/S_script_newsfeed.htm), They all require insertion of complex code into the html page, while all the parameters are hard-coded into the generated code, which looks something like this:
insertedWidgetText = "<script id=\"scrnewsblock10795953\" type=\"text/javascript\">...script specific parameters go here...</script>"
let feedWidget = toWidgetBody [hamlet|#{preEscapedText insertedWidgetText}|]
This doesn't integrate well with Yesod's approach as it requires specifying to Hamlet that the content is preEscapedText, which in turn disables the ability to use Hamlet's processing to alter parameters of the widget dynamically (So in case I want the widget to use a different source, for example, I need to statically change the quoted text and cannot use Hamlet's variable substitution).
Of course I could do some text manipulation myself, tailor built for the widget I'm using, but that doesn't seem like the "right" solution (especially if I want to have the embedded text in some external file and not in the middle of my code as in the example above).
Can the above mentioned issue have a better solution than the one I thought about?
Is there an implementation of a news-feed widget in Haskell/Yesod that I can use as a plugin?
Note: I'm a very poor javascript programmer, but solutions in that direction are also welcomed.
Thanks,

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string