I'm trying to create a clone of getpocket.com for learning. On that app, every saved link gets converted into a markdown; and it seems like the it's a filtered content with only the page title and body without headers, footers, etc.
I could get the page's title using puppeteer api thru different means:
using page.title()
or get the page's opengraph "og:title"
But how do i get like the summarized version containing only the main content of the page.
Note that i don't know beforehand the "css class" of the main content since i'm planning on just entering a url in a textbox and scrape that site from there.
I have found what i've needed for this scenario.
I used the Readability.js library for making webpages readable by removing some certain html tags. Here's the library.
This library is what mozilla uses behind the scenes when rendering their reader view
My use case is: a user goes onto a webpage and modifies it by either filling in a form, populating the page with data from the database, or dragging around some draggables on the page. He can then download the page he modified as pdf. I was thinking of using PhantomJS to do the conversion from html to pdf.
I understand the basic functionality of PhantomJS and got the basic example working but in all the examples I've seen, either a local file or a url is passed in. Example:
page.open('./test.html', function () { ... }
How would I render the page that is getting modified by a user using PhantomJS? I have 2 ideas:
Have the url change as the user modifies the page, and simply pass in the url. For example, the url contains the position of a draggable div.
Send the modified html to back-end, save it, and run PhantomJS
Do these solutions make sense? I'm hoping there would be a simpler way.
I'm writing an extension that scrapes web pages using jquery. After a while I start getting net errors saying resources not available and errors in the console loading images in the pages I'm scraping. I thought it might be $.get() loading it as html somehow, but it still happens when I use a raw XMLHttpRequest and it appears even when I call $(text) with static text.
Looking in the application tab of my background page I can see that there are images, even though they don't exist in the html. For example run this in the console of any extension background page:
$('<div>Hello, world!<img src="https://www.gravatar.com/avatar/fdc806d0a8834e57b2d9309849dea8cd"/></div>')
And you can see the image was loaded on the Application tab in dev tools, though it isn't in the html of the page when inspected and but it's visible on the network tab:
I assume that jquery is creating dom elements to use the browser's capabilities for finding elements, and that chrome is happily pre-fetching that image even though the element isn't on the page and the page will never be visible anyway, but it is causing me errors besides the extra network traffic.
I've tried disabling 'precache' in chrome://flags but that didn't work. For now I'm replacing <img with <noimg which seems to work but is not ideal:
$(text.replace(/<img /g, '<noimg '))
Is there a way to keep this from happening? Is there another library besides jQuery (like cheerio in node) that wouldn't actually create dom objects?
Use the built-in DOMParser to parse the HTML into a detached document, then use jQuery on that document object:
var doc = new DOMParser().parseFromString(yourHTMLstring, 'text/html');
$('.some.selector', doc).attr('foo', 'bar');
In case there may be relative links in the HTML, add a base element explicitly:
$(doc.head).append('<base href="' + realFullURL + '">')
Recently, i integrate node and phantomjs by phantomjs-node. I opened page that has iframe element, i can get the hyperlink element of iframe, but failed when i execute click on it.
Do you have a way? Anyone can help me?
example:
page.open(url);
...
page.evaluate(function(res){
var childDoc = $(window.frames["iframe"].document),
submit = childDoc.find("[id='btnSave']"),
cf = submit.text();//succeed return text
submit.click()//failed
return cf;
},function(res){
console.log("result="+res);//result=submit
spage.render("test.png");//no submit the form
ph.exit();
});
You can't execute stuff in an iframe. You can only read from it. You even created a new document from the iframe, which will only contain the textual representation of the iframe, but it is in no way linked to the original iframe.
You would need to use page.switchToFrame to switch to the frame to execute stuff on the frame without copying it first.
It looks like switchToFrame is not implemented in phantomjs-node. You could try node-phantom.
If the iframe is on the same domain you can try the following from here:
submit = $("iframe").contents().find("[id='btnSave']")
cf = submit.text();
submit.click()
If the iframe is not from the same domain, you will need to create the page with web security turned off:
phantom.create('--web-security=false', function(page){...});
How do you prevent Firefox and Safari from caching iframe content?
I have a simple webpage with an iframe to a page on a different site. Both the outer page and the inner page have HTTP response headers to prevent caching. When I click the "back" button in the browser, the outer page works properly, but no matter what, the browser always retrieves a cache of the iframed page. IE works just fine, but Firefox and Safari are giving me trouble.
My webpage looks something like this:
<html>
<head><!-- stuff --></head>
<body>
<!-- stuff -->
<iframe src="webpage2.html?var=xxx" />
<!-- stuff -->
</body>
</html>
The var variable always changes. Although the URL of the iframe has changed (and thus, the browser should be making a new request to that page), the browser just fetches the cached content.
I've examined the HTTP requests and responses going back and forth, and I noticed that even if the outer page contains <iframe src="webpage2.html?var=222" />, the browser will still fetch webpage2.html?var=111.
Here's what I've tried so far:
Changing iframe URL with random var value
Adding Expires, Cache-Control, and Pragma headers to outer webpage
Adding Expires, Cache-Control, and Pragma headers to inner webpage
I'm unable to do any JavaScript tricks because I'm blocked by the same-origin policy.
I'm running out of ideas. Does anyone know how to stop the browser from caching the iframed content?
Update
I installed Fiddler2 as Daniel suggested to perform another test, and unfortunately, I am still getting the same results.
This is the test I performed:
Outer page generates random number using Math.random() in JSP.
Outer page displays a random number on the webpage.
Outer page calls iframe, passing in a random number.
Inner page displays a random number.
With this test, I'm able to see exactly which pages are updating, and which pages are cached.
Visual Test
For a quick test, I load the page, navigate to another page, and then press "back." Here are the results:
Original Page:
Outer Page: 0.21300034290246206
Inner Page: 0.21300034290246206
Leaving page, then hitting back:
Outer page: 0.4470929019483644
Inner page: 0.21300034290246206
This shows that the inner page is being cached, even though the outer page is calling it with a different GET parameter in the URL. For some reason, the browser is ignoring the fact that the iframe is requesting a new URL; it simply loads the old one.
Fiddler Test
Sure enough, Fiddler confirms the same thing.
(I load the page.)
Outer page is called. HTML:
0.21300034290246206
<iframe src="http://ipv4.fiddler:1416/page1.aspx?var=0.21300034290246206" />
http://ipv4.fiddler:1416/page1.aspx?var=0.21300034290246206 is called.
(I navigate away from the page and then hit back.)
Outer page is called. HTML:
0.4470929019483644
<iframe src="http://ipv4.fiddler:1416/page1.aspx?var=0.4470929019483644" />
http://ipv4.fiddler:1416/page1.aspx?var=0.21300034290246206 is called.
Well, from this test, it looks as though the web browser isn't caching the page, but it's caching the URL of the iframe and then making a new request on that cached URL. However, I'm still stumped as to how to solve this issue.
Does anyone have any ideas on how to stop the web browser from caching iframe URLs?
This is a bug in Firefox:
https://bugzilla.mozilla.org/show_bug.cgi?id=356558
Try this workaround:
<iframe src="webpage2.html?var=xxx" id="theframe"></iframe>
<script>
var _theframe = document.getElementById("theframe");
_theframe.contentWindow.location.href = _theframe.src;
</script>
I have been able to work around this bug by setting a unique name attribute on the iframe - for whatever reason, this seems to bust the cache. You can use whatever dynamic data you have as the name attribute - or simply the current ms or ns time in whatever templating language you're using. This is a nicer solution than those above because it does not directly require JS.
In my particular case, the iframe is being built via JS (but you could do the same via PHP, Ruby, whatever), so I simply use Date.now():
return '<iframe src="' + src + '" name="' + Date.now() + '" />';
This fixes the bug in my testing; probably because the window.name in the inner window changes.
As you said, the issue here is not iframe content caching, but iframe url caching.
As of September 2018, it seems the issue still occurs in Chrome but not in Firefox.
I've tried many things (adding a changing GET parameter, clearing the iframe url in onbeforeunload, detecting a "reload from cache" using a cookie, setting up various response headers) and here are the only two solutions that worked from me:
1- Easy way: create your iframe dynamically from javascript
For example:
const iframe = document.createElement('iframe')
iframe.id = ...
...
iframe.src = myIFrameUrl
document.body.appendChild(iframe)
2- Convoluted way
Server-side, as explained here, disable content caching for the content you serve for the iframe OR for the parent page (either will do).
AND
Set the iframe url from javascript with an additional changing search param, like this:
const url = myIFrameUrl + '?timestamp=' + new Date().getTime()
document.getElementById('my-iframe-id').src = url
(simplified version, beware of other search params)
After trying everything else (except using a proxy for the iframe content), I found a way to prevent iframe content caching, from the same domain:
Use .htaccess and a rewrite rule and change the iframe src attribute.
RewriteRule test/([0-9]+)/([a-zA-Z0-9]+).html$ /test/index.php?idEntity=$1&token=$2 [QSA]
The way I use this is that the iframe's URL end up looking this way: example.com/test/54/e3116491e90e05700880bf8b269a8cc7.html
Where [token] is a randomly generated value. This URL prevents iframe caching since the token is never the same, and the iframe thinks it's a totally different webpage since a single refresh loads a totally different URL :
example.com/test/54/e3116491e90e05700880bf8b269a8cc7.html
example.com/test/54/d2cc21be7cdcb5a1f989272706de1913.html
both lead to the same page.
You can access your hidden url parameters with $_SERVER["QUERY_STRING"]
To get the iframe to always load fresh content, add the current Unix timestamp to the end of the GET parameters. The browser then sees it as a 'different' request and will seek new content.
In Javascript, it might look like:
frames['my_iframe'].location.href='load_iframe_content.php?group_ID=' + group_ID + '×tamp=' + timestamp;
I found this problem in the latest Chrome as well as the latest Safari on the Mac OS X as of Mar 17, 2016. None of the fixes above worked for me, including assigning src to empty and then back to some site, or adding in some randomly-named "name" parameter, or adding in a random number on the end of the URL after the hash, or assigning the content window href to the src after assigning the src.
In my case, it was because I was using Javascript to update the IFRAME, and only switching the hash in the URL.
The workaround in my case was that I created an interim URL that had a 0 second meta redirect to that other page. It happens so fast that I hardly notice the screen flash. Plus, I made the background color of the interim page the same as the other page, and so you notice it even less.
It is a bug in Firefox 3.5.
Have a look..
https://bugzilla.mozilla.org/show_bug.cgi?id=279048
I set iframe src attribute later in my app. To get rid of the cached content inside iframe at the start of the application I simply do:
myIframe.src = "";
... somewhere in the beginning of js code (for instance in jquery $() handler)
Thanks to
http://www.freshsupercool.com/2008/07/10/firefox-caching-iframe-data/
I also had this problem in 2016 with iOS Safari. What seemed to work for me was
giving a GET-parameter to the iframe src and a value for it like this
<iframe width="60%" src="../other/url?cachebust=1" allowfullscreen></iframe>
I also met this issue, after trying different browsers, and a ton of trial and error, I came up with this solution, which works well in my case:
import { defineComponent } from 'vue'
import { v4 as uuid } from 'uuid'
export default defineComponent({
setup() {
return () => (
// append a uuid after `?` to prevent browsers from caching it
<iframe src={`https://www.example.com?${uuid()}`} frameborder='0' />
)
},
})
If you want to get really crazy you could implement the page name as a dynamic url that always resolves to the same page, rather than the querystring option?
Assuming you're in an office, check whether there's any caching going on at a network level. Believe me, it's a possibility. Your IT folks will be able to tell you if there's any network infrastructure around HTTP caching, although since this only happens for the iframe it's unlikely.
Have you installed Fiddler2?
It will let you see exactly what is being requested, what is being sent back, etc. It doesn't sound plausible that the browser would really hit its cache for different URLs.
Make the URL of the iframe point to a page on your site which acts as a proxy to retrieve and return the actual contents of the iframe. Now you are no longer bound by the same-origin policy (EDIT: does not prevent the iframe caching issue).