Using jquery for parsing causes image network traffic in Chrome extension? - google-chrome-extension

I'm writing an extension that scrapes web pages using jquery. After a while I start getting net errors saying resources not available and errors in the console loading images in the pages I'm scraping. I thought it might be $.get() loading it as html somehow, but it still happens when I use a raw XMLHttpRequest and it appears even when I call $(text) with static text.
Looking in the application tab of my background page I can see that there are images, even though they don't exist in the html. For example run this in the console of any extension background page:
$('<div>Hello, world!<img src="https://www.gravatar.com/avatar/fdc806d0a8834e57b2d9309849dea8cd"/></div>')
And you can see the image was loaded on the Application tab in dev tools, though it isn't in the html of the page when inspected and but it's visible on the network tab:
I assume that jquery is creating dom elements to use the browser's capabilities for finding elements, and that chrome is happily pre-fetching that image even though the element isn't on the page and the page will never be visible anyway, but it is causing me errors besides the extra network traffic.
I've tried disabling 'precache' in chrome://flags but that didn't work. For now I'm replacing <img with <noimg which seems to work but is not ideal:
$(text.replace(/<img /g, '<noimg '))
Is there a way to keep this from happening? Is there another library besides jQuery (like cheerio in node) that wouldn't actually create dom objects?

Use the built-in DOMParser to parse the HTML into a detached document, then use jQuery on that document object:
var doc = new DOMParser().parseFromString(yourHTMLstring, 'text/html');
$('.some.selector', doc).attr('foo', 'bar');
In case there may be relative links in the HTML, add a base element explicitly:
$(doc.head).append('<base href="' + realFullURL + '">')

Related

In Chrome extensions, why use a background page with HTML?

I understand that the background page of a Chrome extension is never displayed. It makes sense to me that a background page should contain only scripts. In what situations would HTML markup ever be needed?
At https://developer.chrome.com/extensions/background_pages there is an example with an HTML background page, but I haven't been able to get it to work (perhaps because I am not sure what it should be doing).
Are there any examples of simple Chrome extensions which demonstrate how HTML markup can be useful in a background page?
Historical reasons
The background page is, technically, a whole separate document - except it's not rendered in an actual tab.
For simplicity's sake, perhaps, extensions started with requiring a full HTML page for the background page through the background_page manifest property. That was the only form.
But, as evidenced by your question, most of the time it's not clear what the page can actually be used for except for holding scripts. That made the entire thing being just a piece of boilerplate.
That's why when Chrome introduced "manifest_version": 2 in 2012 as a big facelift to extensions, they added an alternative format, background.scripts array. This will offload the boilerplate to Chrome, which will then create a background page document for you, succinctly called _generated_background_page.html.
Today, this is a preferred method, though background.page is still available.
Practical reasons
With all the above said, you still sometimes want to have actual elements in your background page's document.
<script> for dynamically adding scripts to the background page (as long as they conform to extension CSP).
Among other things, since you can't include external scripts through background.scripts array, you need to create a <script> element for those you whitelist for the purpose.
<canvas> for preparing image data for use elsewhere, for example in Browser Action icons.
<audio> for producing sounds.
<textarea> for (old-school) working with clipboard (don't actually do this).
<iframe> for embedding an external page into the background page, which can sometimes help extracting dynamic data.
..possibly more.
It's debatable which boilerplate is "better": creating the elements in advance as a document, or using document.createElement and its friends as needed.
In any case, a background page is always a page, whether provided by you or autogenerated by Chrome. You can use all the DOM functions you want.
My two cents:
Take Google Mail Checker as an example, it declares a canvas in background.html
<canvas id="canvas" width="19" height="19">
Then it could manipulate the canvas in background.js and call chrome.browserAction.setIcon({imageData: canvasContext.getImageData(...)}) to change the browser action icon.
I know we could dynamically create canvas via background.js, however when doing something involving DOM element, using html directly seems easier.

Mobile Safari fails to scroll to named anchor

I have a big SVG document here, containing a map of all the quests in a certain online game. Each quest node is inside a SVG <a> element, linking to a distinct named anchor in a big HTML document that loads in another tab, containing further details about that particular quest. This works exactly as desired in desktop Safari, and I'd expect it to work just as well in any browser that supports SVG at all since I'm using only the most basic form of linking, but it fails badly on Mobile Safari (iOS 6) - which is my single most important browser target, considering that the game in question is for the iPad. It only scrolls to the correct anchor on the initial load of the HTML page; clicking a different quest in the SVG tab will cause a switch to the HTML tab, and the hash (fragment ID) in the address bar changes, but the page doesn't auto-scroll.
This appears to be a known limitation in Mobile Safari - hash-only changes in the URL apparently used to force a page reload, and that got over-fixed such that nothing gets triggered at all now. The fixes I've found online all seem to be applicable only in cases where the URL change is being generated programatically, from within the same document, rather than static links from a different document.
Further details:
I've tried doing the named anchors in both the old <a name="..."> form, and the newer <h1 id="..."> form. No difference.
I've tried adding an onhashchange handler, to force the scrolling to take place, but the handler isn't being called at all (verified by putting an alert() in it).
I could presumably fix the problem by having each quest's details in a separate HTML file, but that would severely affect usability - with all the details in a single file, you can use your browser's Find feature to search through them all at once. (Also, deploying 1006 files to my web hosting after each update would be a bit of a pain...)
Anybody have an idea for a work-around?

chrome content script to access and modify window

I am writing a chrome extension that is a 'content script'
I want to inject a google map on to a webpage.
Problem:
It appears that i have no way to add functions on to the window object, thus i cannot define a callback function for googlemaps to call when it loads.
How do people usually go about mucking with the window?
--
someone on the interwebs suggested i do this:
You can do this easily with a JavaScript URL: window.location =
"javascript:obj.funcvar = function() {}; void(0);"
but when i did this i got an access denied error. it seems like a lot of search results about this problem are outdated.
Content scripts have a separate JavaScript execution ennvironment from the page they run on, so they cannot alter JS variables in the page itself. However, the content script shares the DOM with the page, so you can inject a <script> tag into the DOM which will be loaded and run in the actual page's execution environment.

Safari is more forgiving locally than remotely with malformed HTML. Why?

I ran into a curious issue today. We have a web page that hides the body via CSS and then there's a bit of JavaScript that sets the body to display: block to show it. (This is part of some iFrame busting logic we are required to add).
We were having issues on one page but only in Safari. In taking a look at things, I found that the culprit was that an include file was being called that contained its own body tag so we were ending up with malformed HTML with a body tag nested within the pages existing body tag.
Since the JS was looking for the first body tag the content we actually wanted to show was never shown, since it was wrapped with the second body tag.
I assume Firefox was just forgiving of the HTML and ignored the second body tag. Safari didn't do this when we looked at the page on the server.
However, if I grab the file and run it locally, Safari does tell me:
Extra <body> encountered. Migrating attributes back to the original <body> element and ignoring the tag.
I'm curious as to why Safari might have adopted this 'policy' of ignoring bad HTML locally but not from a server. If it matters, it is an https site we're hitting. Perhaps Safari is being wise and trying to avoid any potential security issues with allowing bad HTML?

Preventing iframe caching in browser

How do you prevent Firefox and Safari from caching iframe content?
I have a simple webpage with an iframe to a page on a different site. Both the outer page and the inner page have HTTP response headers to prevent caching. When I click the "back" button in the browser, the outer page works properly, but no matter what, the browser always retrieves a cache of the iframed page. IE works just fine, but Firefox and Safari are giving me trouble.
My webpage looks something like this:
<html>
<head><!-- stuff --></head>
<body>
<!-- stuff -->
<iframe src="webpage2.html?var=xxx" />
<!-- stuff -->
</body>
</html>
The var variable always changes. Although the URL of the iframe has changed (and thus, the browser should be making a new request to that page), the browser just fetches the cached content.
I've examined the HTTP requests and responses going back and forth, and I noticed that even if the outer page contains <iframe src="webpage2.html?var=222" />, the browser will still fetch webpage2.html?var=111.
Here's what I've tried so far:
Changing iframe URL with random var value
Adding Expires, Cache-Control, and Pragma headers to outer webpage
Adding Expires, Cache-Control, and Pragma headers to inner webpage
I'm unable to do any JavaScript tricks because I'm blocked by the same-origin policy.
I'm running out of ideas. Does anyone know how to stop the browser from caching the iframed content?
Update
I installed Fiddler2 as Daniel suggested to perform another test, and unfortunately, I am still getting the same results.
This is the test I performed:
Outer page generates random number using Math.random() in JSP.
Outer page displays a random number on the webpage.
Outer page calls iframe, passing in a random number.
Inner page displays a random number.
With this test, I'm able to see exactly which pages are updating, and which pages are cached.
Visual Test
For a quick test, I load the page, navigate to another page, and then press "back." Here are the results:
Original Page:
Outer Page: 0.21300034290246206
Inner Page: 0.21300034290246206
Leaving page, then hitting back:
Outer page: 0.4470929019483644
Inner page: 0.21300034290246206
This shows that the inner page is being cached, even though the outer page is calling it with a different GET parameter in the URL. For some reason, the browser is ignoring the fact that the iframe is requesting a new URL; it simply loads the old one.
Fiddler Test
Sure enough, Fiddler confirms the same thing.
(I load the page.)
Outer page is called. HTML:
0.21300034290246206
<iframe src="http://ipv4.fiddler:1416/page1.aspx?var=0.21300034290246206" />
http://ipv4.fiddler:1416/page1.aspx?var=0.21300034290246206 is called.
(I navigate away from the page and then hit back.)
Outer page is called. HTML:
0.4470929019483644
<iframe src="http://ipv4.fiddler:1416/page1.aspx?var=0.4470929019483644" />
http://ipv4.fiddler:1416/page1.aspx?var=0.21300034290246206 is called.
Well, from this test, it looks as though the web browser isn't caching the page, but it's caching the URL of the iframe and then making a new request on that cached URL. However, I'm still stumped as to how to solve this issue.
Does anyone have any ideas on how to stop the web browser from caching iframe URLs?
This is a bug in Firefox:
https://bugzilla.mozilla.org/show_bug.cgi?id=356558
Try this workaround:
<iframe src="webpage2.html?var=xxx" id="theframe"></iframe>
<script>
var _theframe = document.getElementById("theframe");
_theframe.contentWindow.location.href = _theframe.src;
</script>
I have been able to work around this bug by setting a unique name attribute on the iframe - for whatever reason, this seems to bust the cache. You can use whatever dynamic data you have as the name attribute - or simply the current ms or ns time in whatever templating language you're using. This is a nicer solution than those above because it does not directly require JS.
In my particular case, the iframe is being built via JS (but you could do the same via PHP, Ruby, whatever), so I simply use Date.now():
return '<iframe src="' + src + '" name="' + Date.now() + '" />';
This fixes the bug in my testing; probably because the window.name in the inner window changes.
As you said, the issue here is not iframe content caching, but iframe url caching.
As of September 2018, it seems the issue still occurs in Chrome but not in Firefox.
I've tried many things (adding a changing GET parameter, clearing the iframe url in onbeforeunload, detecting a "reload from cache" using a cookie, setting up various response headers) and here are the only two solutions that worked from me:
1- Easy way: create your iframe dynamically from javascript
For example:
const iframe = document.createElement('iframe')
iframe.id = ...
...
iframe.src = myIFrameUrl
document.body.appendChild(iframe)
2- Convoluted way
Server-side, as explained here, disable content caching for the content you serve for the iframe OR for the parent page (either will do).
AND
Set the iframe url from javascript with an additional changing search param, like this:
const url = myIFrameUrl + '?timestamp=' + new Date().getTime()
document.getElementById('my-iframe-id').src = url
(simplified version, beware of other search params)
After trying everything else (except using a proxy for the iframe content), I found a way to prevent iframe content caching, from the same domain:
Use .htaccess and a rewrite rule and change the iframe src attribute.
RewriteRule test/([0-9]+)/([a-zA-Z0-9]+).html$ /test/index.php?idEntity=$1&token=$2 [QSA]
The way I use this is that the iframe's URL end up looking this way: example.com/test/54/e3116491e90e05700880bf8b269a8cc7.html
Where [token] is a randomly generated value. This URL prevents iframe caching since the token is never the same, and the iframe thinks it's a totally different webpage since a single refresh loads a totally different URL :
example.com/test/54/e3116491e90e05700880bf8b269a8cc7.html
example.com/test/54/d2cc21be7cdcb5a1f989272706de1913.html
both lead to the same page.
You can access your hidden url parameters with $_SERVER["QUERY_STRING"]
To get the iframe to always load fresh content, add the current Unix timestamp to the end of the GET parameters. The browser then sees it as a 'different' request and will seek new content.
In Javascript, it might look like:
frames['my_iframe'].location.href='load_iframe_content.php?group_ID=' + group_ID + '&timestamp=' + timestamp;
I found this problem in the latest Chrome as well as the latest Safari on the Mac OS X as of Mar 17, 2016. None of the fixes above worked for me, including assigning src to empty and then back to some site, or adding in some randomly-named "name" parameter, or adding in a random number on the end of the URL after the hash, or assigning the content window href to the src after assigning the src.
In my case, it was because I was using Javascript to update the IFRAME, and only switching the hash in the URL.
The workaround in my case was that I created an interim URL that had a 0 second meta redirect to that other page. It happens so fast that I hardly notice the screen flash. Plus, I made the background color of the interim page the same as the other page, and so you notice it even less.
It is a bug in Firefox 3.5.
Have a look..
https://bugzilla.mozilla.org/show_bug.cgi?id=279048
I set iframe src attribute later in my app. To get rid of the cached content inside iframe at the start of the application I simply do:
myIframe.src = "";
... somewhere in the beginning of js code (for instance in jquery $() handler)
Thanks to
http://www.freshsupercool.com/2008/07/10/firefox-caching-iframe-data/
I also had this problem in 2016 with iOS Safari. What seemed to work for me was
giving a GET-parameter to the iframe src and a value for it like this
<iframe width="60%" src="../other/url?cachebust=1" allowfullscreen></iframe>
I also met this issue, after trying different browsers, and a ton of trial and error, I came up with this solution, which works well in my case:
import { defineComponent } from 'vue'
import { v4 as uuid } from 'uuid'
export default defineComponent({
setup() {
return () => (
// append a uuid after `?` to prevent browsers from caching it
<iframe src={`https://www.example.com?${uuid()}`} frameborder='0' />
)
},
})
If you want to get really crazy you could implement the page name as a dynamic url that always resolves to the same page, rather than the querystring option?
Assuming you're in an office, check whether there's any caching going on at a network level. Believe me, it's a possibility. Your IT folks will be able to tell you if there's any network infrastructure around HTTP caching, although since this only happens for the iframe it's unlikely.
Have you installed Fiddler2?
It will let you see exactly what is being requested, what is being sent back, etc. It doesn't sound plausible that the browser would really hit its cache for different URLs.
Make the URL of the iframe point to a page on your site which acts as a proxy to retrieve and return the actual contents of the iframe. Now you are no longer bound by the same-origin policy (EDIT: does not prevent the iframe caching issue).

Resources