I would like to read the entire content of a fully rendered webpage into nodejs and do some stuff with the content.
At the moment I am using PhantomJS but it is so unstable. It crashes every 10-20 pages and it leaks memory like crazy. (from 300MB to 2.8GB after just 15 pages)
Its the same on our Ubuntu server - it runs for 10-20 pages and then crashes.
I can see a lot of other people out there have exact same problem with PhantomJS.
So I wondered... what are the alternatives?
Anyone here knows about how to fix PhamtomJS or knows another simple stable component which can read a rendered webpage and put it into a variable in nodejs?
Any help will be MUCH appreciated - I wasted over 100 hours trying to get PhantomJS to work (new instance for each page, re-using same instance, turning down the speed using timeouts etc etc etc... no matter what it still leaks and still crashes).
In the past when scraping heavy sites I achieved good results cancelling some of the requests made to 3d party sites, like Google Maps, Facebook and Twitter widgets, ad distributors and such, see here in more detail.
But nowadays I just suggest puppeteer. It's a native node module, it uses the latest Chromium as a browser and it is being continuously developed by Google engineers. API ideology is based on that of PhantomJS. Usage in node 8+ with async/await provides the most satisfying scraping experience.
Puppeteer is a bit heavier on the hardware though.
Consider the example for getting the page contents:
const puppeteer = require('puppeteer');
puppeteer.launch().then(async browser => {
const page = await browser.newPage();
await page.goto('https://angular.io/', {waitUntil : 'networkidle2'});
const contents = await page.content();
console.log(contents);
await browser.close();
});
Related
I have a client who has asked me to create thumbnails of images downloaded. I have a snippet of code which uses node-fetch to download the image into a buffer:
const fetch = require('node-fetch');
const URL =
'https://lf.lids.com/hwl?set=sku[20905595],c[2],w[400],h[300]&call=url[file:product]';
async function main() {
const t = await fetch(URL);
const tt = await t.buffer();
debugger;
}
main();
This works for most images except the one one in the code. I have a feeling lids.com may either be doing some redirect magic or preventing scraping from happening, but I'm not able to debug this.
I've also tried setting an assortment of headers to mimic the browser (which loads the image), but nothing has worked so far. I'm not sure if this is a library issue or an operational issue.
Turns out there are two issues:
Agent needs to be changed to reflect the browser
Some servers just straight up block anything from AWS to prevent scraping, we ended up using crawlera to get around this
I know about the event loop and the single threaded nature of NodeJS. Given that, do you think it is a good idea to go ahead and develop a NodeJS/Express service that we can use to convert HTML parts to PDF pages?
We are thinking about Puppeteer. I already used it and it works great, but I'm not sure if each user in the organization would have to be waiting for the event loop because each request would keep the process busy until the end?
Event Loop
The event loop is what takes care of the "single-threaded event-driven" nature of JavaScript, meaning that asynchronous (JavaScript) code that needs to be executed will be put in a queue and executed one after another (by the loop) instead of using a more classic multi-threading approach. For more information on this topic I recommend this great video explanation.
The event loop is not really related to your problem, as most of the work happens asynchronously inside the browser (and not inside the Node.js runtime). That means that your puppeteer script will most of the time wait for the browser to return results.
Consider a simple line like this:
await browser.newPage();
What does this actually do? It sends the command to the browser (running in another process) to open a page. The actual work is happening inside the browser, not in your Node.js environment. The same goes for basically all puppeteer functions. Therefore, the "main work" is not happening inside your Node.js environment and therefore the event loop is not related to your problem.
Implemenation
What you are describing is absolutely doable with puppeteer and Node.js. Let's consider this example code which should get you started:
const puppeteer = require('puppeteer');
const express = require('express');
const app = express();
app.get('/pdf', async (req, res) => { // Call /pdf?url=... to create a PDF of the provided URL
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(req.query.url); // URL is given by the user
const pdfBuffer = await page.pdf();
// Respond with the PDF
res.writeHead(200, {
'Content-Type': 'application/pdf',
'Content-Length': pdfBuffer.length
});
res.end(pdfBuffer);
await browser.close();
});
app.listen(4000);
This will offer an API to generate a PDF of a URL. Every request will open a browser, open a new page, navigate to the given URL and return a PDF to the user. Thanks to the asynchronous environment of JavaScript, this will happen fully in parallel. As long as your machine can handle the number of parallel open browsers, you are fine.
Further improvement
While the given script works, you should keep in mind that too many requets might quickly consume too much memory/CPU due to many open browsers and therefore lead to resource problems. To improve the implementation, you want to use a pool of puppeteer resources to handle the traffic. For that you might want to look into puppeteer-cluster (disclaimer: I'm the author) which provides you with pool of browser instances and will allow to limit the number of running browsers. The library can handle this use case easily. There is actually an example online for this exact use case (however, it generates screenshots instead of PDFs).
Now that I've realized that executing a function like
let parser = new DOMParser();
works only when it's executed with a browser, but not with a server like Express/Node, then, what are the options?
I've read many posts with NodeJS alternatives to DOMParser, but then there're the libraries
https://www.npmjs.com/package/dom-parser
https://www.npmjs.com/package/dom-node-iterator
Are these ones valid alternatives? Do I install them and adapt the method calls depending on whether it is the browser or Node the one that uses them?
I'm asking because looking at posts like this one HTML-parser on Node.js it doesn't seem so simple
I have a web scraper which uses the the nightmare browser automation library. Everytime I would run my nodejs app it opens up a browser window and loads the page I am trying to scrape. But I want to run it completely in the console without any windows popping up.
The closest you are going to get is just keeping the window hidden. To do this, instantiate your Nightmare as such:
var nm = new Nightmare({show: false})
I would say "don't". Use Cheerio instead, it's built for headless HTML scraping
I want to perform following actions at the server side:
1) Scrape a webpage
2) Simulate a click on that page and then navigate to the new page.
3) Scrape the new page
4) Simulate some button clicks on the new page
5) Sending the data back to the client via json or something
I am thinking of using it with Node.js.
But am confused as to which module should i use
a) Zombie
b) Node.io
c) Phantomjs
d) JSDOM
e) Anything else
I have installed node,io but am not able to run it via command prompt.
PS: I am working in windows 2008 server
Zombie.js and Node.io run on JSDOM, hence your options are either going with JSDOM (or any equivalent wrapper), a headless browser (PhantomJS, SlimerJS) or Cheerio.
JSDOM is fairly slow because it has to recreate DOM and CSSOM in Node.js.
PhantomJS/SlimerJS are proper headless browsers, thus performances are ok and those are also very reliable.
Cheerio is a lightweight alternative to JSDOM. It doesn't recreate the entire page in Node.js (it just downloads and parses the DOM - no javascript is executed). Therefore you can't really click on buttons/links, but it's very fast to scrape webpages.
Given your requirements, I'd probably go with something like a headless browser. In particular, I'd choose CasperJS because it has a nice and expressive API, it's fast and reliable (it doesn't need to reinvent the wheel on how to parse and render the dom or css like JSDOM does) and it's very easy to interact with elements such as buttons and links.
Your workflow in CasperJS should look more or less like this:
casper.start();
casper
.then(function(){
console.log("Start:");
})
.thenOpen("https://www.domain.com/page1")
.then(function(){
// scrape something
this.echo(this.getHTML('h1#foobar'));
})
.thenClick("#button1")
.then(function(){
// scrape something else
this.echo(this.getHTML('h2#foobar'));
})
.thenClick("#button2")
thenOpen("http://myserver.com", {
method: "post",
data: {
my: 'data',
}
}, function() {
this.echo("data sent back to the server")
});
casper.run();
Short answer (in 2019): Use puppeteer
If you need a full (headless) browser, use puppeteer instead of PhantomJS as it offers an up-to-date Chromium browser with a rich API to automate any browser crawling and scraping tasks. If you only want to parse a HTML document (without executing JavaScript inside the page) you should check out jsdom and cheerio.
Explanation
Tools like jsdom (or cheerio) allow it to extract information from a HTML document by parsing it. This is fast and works well as long as the website does not contain JavaScript. It will be very hard or even impossible to extract information from a website built on JavaScript. jsdom, for example, is able to execute scripts, but runs them inside a sandbox in your Node.js environment, which can be very dangerous and possibly crash your application. To quote the docs:
However, this is also highly dangerous when dealing with untrusted content.
Therefore, to reliably crawl more complex websites, you need an actual browser. For years, the most popular solution for this task was PhantomJS. But in 2018, the development of PhantomJS was offically suspended. Thankfully, since April 2017 the Google Chrome team makes it possible to run the Chrome browser headlessly (announcement).
This makes it possible to crawl websites using an up-to-date browser with full JavaScript support.
To control the browser, the library puppeteer, which is also maintained by Google developers, offers a rich API for use within the Node.js environment.
Code sample
The lines below, show a simple example. It uses Promises and the async/await syntax to execute a number of tasks. First, the browser is started (puppeteer.launch) and a URL is opened page.goto.
After that, a functions like page.evaluate and page.click are used to extract information and execute actions on the page. Finally, the browser is closed (browser.close).
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
// example: get innerHTML of an element
const someContent = await page.$eval('#selector', el => el.innerHTML);
// Use Promise.all to wait for two actions (navigation and click)
await Promise.all([
page.waitForNavigation(), // wait for navigation to happen
page.click('a.some-link'), // click link to cause navigation
]);
// another example, this time using the evaluate function to return innerText of body
const moreContent = await page.evaluate(() => document.body.innerText);
// click another button
await page.click('#button');
// close brower when we are done
await browser.close();
})();
The modules you listed do the following:
Phantomjs/Zombie - simulate browser (headless - nothing is actually displayed). Can be used for scraping static or dynamic. Or testing of your html pages.
Node.io/jsdom - webscraping : extracting data from page (static).
Looking at your requirements, you could use phantom or zombie.