Scrape a webpage and navigate by clicking buttons - node.js

I want to perform following actions at the server side:
1) Scrape a webpage
2) Simulate a click on that page and then navigate to the new page.
3) Scrape the new page
4) Simulate some button clicks on the new page
5) Sending the data back to the client via json or something
I am thinking of using it with Node.js.
But am confused as to which module should i use
a) Zombie
b) Node.io
c) Phantomjs
d) JSDOM
e) Anything else
I have installed node,io but am not able to run it via command prompt.
PS: I am working in windows 2008 server

Zombie.js and Node.io run on JSDOM, hence your options are either going with JSDOM (or any equivalent wrapper), a headless browser (PhantomJS, SlimerJS) or Cheerio.
JSDOM is fairly slow because it has to recreate DOM and CSSOM in Node.js.
PhantomJS/SlimerJS are proper headless browsers, thus performances are ok and those are also very reliable.
Cheerio is a lightweight alternative to JSDOM. It doesn't recreate the entire page in Node.js (it just downloads and parses the DOM - no javascript is executed). Therefore you can't really click on buttons/links, but it's very fast to scrape webpages.
Given your requirements, I'd probably go with something like a headless browser. In particular, I'd choose CasperJS because it has a nice and expressive API, it's fast and reliable (it doesn't need to reinvent the wheel on how to parse and render the dom or css like JSDOM does) and it's very easy to interact with elements such as buttons and links.
Your workflow in CasperJS should look more or less like this:
casper.start();
casper
.then(function(){
console.log("Start:");
})
.thenOpen("https://www.domain.com/page1")
.then(function(){
// scrape something
this.echo(this.getHTML('h1#foobar'));
})
.thenClick("#button1")
.then(function(){
// scrape something else
this.echo(this.getHTML('h2#foobar'));
})
.thenClick("#button2")
thenOpen("http://myserver.com", {
method: "post",
data: {
my: 'data',
}
}, function() {
this.echo("data sent back to the server")
});
casper.run();

Short answer (in 2019): Use puppeteer
If you need a full (headless) browser, use puppeteer instead of PhantomJS as it offers an up-to-date Chromium browser with a rich API to automate any browser crawling and scraping tasks. If you only want to parse a HTML document (without executing JavaScript inside the page) you should check out jsdom and cheerio.
Explanation
Tools like jsdom (or cheerio) allow it to extract information from a HTML document by parsing it. This is fast and works well as long as the website does not contain JavaScript. It will be very hard or even impossible to extract information from a website built on JavaScript. jsdom, for example, is able to execute scripts, but runs them inside a sandbox in your Node.js environment, which can be very dangerous and possibly crash your application. To quote the docs:
However, this is also highly dangerous when dealing with untrusted content.
Therefore, to reliably crawl more complex websites, you need an actual browser. For years, the most popular solution for this task was PhantomJS. But in 2018, the development of PhantomJS was offically suspended. Thankfully, since April 2017 the Google Chrome team makes it possible to run the Chrome browser headlessly (announcement).
This makes it possible to crawl websites using an up-to-date browser with full JavaScript support.
To control the browser, the library puppeteer, which is also maintained by Google developers, offers a rich API for use within the Node.js environment.
Code sample
The lines below, show a simple example. It uses Promises and the async/await syntax to execute a number of tasks. First, the browser is started (puppeteer.launch) and a URL is opened page.goto.
After that, a functions like page.evaluate and page.click are used to extract information and execute actions on the page. Finally, the browser is closed (browser.close).
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
// example: get innerHTML of an element
const someContent = await page.$eval('#selector', el => el.innerHTML);
// Use Promise.all to wait for two actions (navigation and click)
await Promise.all([
page.waitForNavigation(), // wait for navigation to happen
page.click('a.some-link'), // click link to cause navigation
]);
// another example, this time using the evaluate function to return innerText of body
const moreContent = await page.evaluate(() => document.body.innerText);
// click another button
await page.click('#button');
// close brower when we are done
await browser.close();
})();

The modules you listed do the following:
Phantomjs/Zombie - simulate browser (headless - nothing is actually displayed). Can be used for scraping static or dynamic. Or testing of your html pages.
Node.io/jsdom - webscraping : extracting data from page (static).
Looking at your requirements, you could use phantom or zombie.

Related

How to Clear History (Clear browsing data) In Node.js Puppeteer Headless=false Chromium browser

I am tring to delete history in headless=false browser with node.js puppeteer through below code but non of method work.
await page.goto('chrome://settings/clearBrowserData');
await page.keyboard.down('Enter');
Second code
await page.keyboard.down('ControlLeft');
await page.keyboard.down('ShiftLeft');
await page.keyboard.down('Delete');
await page.keyboard.down('Enter');
i tried to used .evaluateHandle() and .click() function too but non of them work. if anyone know how to clear history with puppeteer please answer me.
It is not possible to navigate to the browser settings pages (chrome://...) like that.
You have three options:
Use an incognito window (called context in puppeteer)
Use a command from the Chrome DevTools Protocol to clear history.
Restart the browser
Option 1: Use an incognito window
To clear the history (including cookies and any data), you can use an "incognito" window called BrowserContext in puppeteer.
You create a context by calling browser.createIncognitoBrowserContext(). Quote from the docs:
Creates a new incognito browser context. This won't share cookies/cache with other browser contexts.
Example
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
// Execute your code
await page.goto('...');
// ...
await context.close(); // clear history
This example will create a new incognito browser window and open a page inside. From there on you can use the page handle like you normally would.
To clear any cookies or history inside, simply close the context via context.close().
Option 2: Use the Chrome DevTools Protocol to clear history
If you cannot rely on using context (as they are not supported when using extensions), you can use the Chrome DevTools Protocol to clear the history of the browser. It has functions which are not implemented in puppeteer to reset cookies and cache. You can directly use functions from the Chrome DevTools Protocol by using a CDPSession.
Example
const client = await page.target().createCDPSession();
await client.send('Network.clearBrowserCookies');
await client.send('Network.clearBrowserCache');
This will instruct the browser to clear the cookies and cache by directly calling Network.clearBrowserCookies and Network.clearBrowserCache.
Option 3: Restart the browser
If both approaches are not feasible, you can alway restart the browser, by closing the old instance and creating a new one. This will clear any stored data.

NodeJS with Express as a server for HTML --> PDF generation. Can it be efficient?

I know about the event loop and the single threaded nature of NodeJS. Given that, do you think it is a good idea to go ahead and develop a NodeJS/Express service that we can use to convert HTML parts to PDF pages?
We are thinking about Puppeteer. I already used it and it works great, but I'm not sure if each user in the organization would have to be waiting for the event loop because each request would keep the process busy until the end?
Event Loop
The event loop is what takes care of the "single-threaded event-driven" nature of JavaScript, meaning that asynchronous (JavaScript) code that needs to be executed will be put in a queue and executed one after another (by the loop) instead of using a more classic multi-threading approach. For more information on this topic I recommend this great video explanation.
The event loop is not really related to your problem, as most of the work happens asynchronously inside the browser (and not inside the Node.js runtime). That means that your puppeteer script will most of the time wait for the browser to return results.
Consider a simple line like this:
await browser.newPage();
What does this actually do? It sends the command to the browser (running in another process) to open a page. The actual work is happening inside the browser, not in your Node.js environment. The same goes for basically all puppeteer functions. Therefore, the "main work" is not happening inside your Node.js environment and therefore the event loop is not related to your problem.
Implemenation
What you are describing is absolutely doable with puppeteer and Node.js. Let's consider this example code which should get you started:
const puppeteer = require('puppeteer');
const express = require('express');
const app = express();
app.get('/pdf', async (req, res) => { // Call /pdf?url=... to create a PDF of the provided URL
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(req.query.url); // URL is given by the user
const pdfBuffer = await page.pdf();
// Respond with the PDF
res.writeHead(200, {
'Content-Type': 'application/pdf',
'Content-Length': pdfBuffer.length
});
res.end(pdfBuffer);
await browser.close();
});
app.listen(4000);
This will offer an API to generate a PDF of a URL. Every request will open a browser, open a new page, navigate to the given URL and return a PDF to the user. Thanks to the asynchronous environment of JavaScript, this will happen fully in parallel. As long as your machine can handle the number of parallel open browsers, you are fine.
Further improvement
While the given script works, you should keep in mind that too many requets might quickly consume too much memory/CPU due to many open browsers and therefore lead to resource problems. To improve the implementation, you want to use a pool of puppeteer resources to handle the traffic. For that you might want to look into puppeteer-cluster (disclaimer: I'm the author) which provides you with pool of browser instances and will allow to limit the number of running browsers. The library can handle this use case easily. There is actually an example online for this exact use case (however, it generates screenshots instead of PDFs).

Read content from rendered webpage into nodejs

I would like to read the entire content of a fully rendered webpage into nodejs and do some stuff with the content.
At the moment I am using PhantomJS but it is so unstable. It crashes every 10-20 pages and it leaks memory like crazy. (from 300MB to 2.8GB after just 15 pages)
Its the same on our Ubuntu server - it runs for 10-20 pages and then crashes.
I can see a lot of other people out there have exact same problem with PhantomJS.
So I wondered... what are the alternatives?
Anyone here knows about how to fix PhamtomJS or knows another simple stable component which can read a rendered webpage and put it into a variable in nodejs?
Any help will be MUCH appreciated - I wasted over 100 hours trying to get PhantomJS to work (new instance for each page, re-using same instance, turning down the speed using timeouts etc etc etc... no matter what it still leaks and still crashes).
In the past when scraping heavy sites I achieved good results cancelling some of the requests made to 3d party sites, like Google Maps, Facebook and Twitter widgets, ad distributors and such, see here in more detail.
But nowadays I just suggest puppeteer. It's a native node module, it uses the latest Chromium as a browser and it is being continuously developed by Google engineers. API ideology is based on that of PhantomJS. Usage in node 8+ with async/await provides the most satisfying scraping experience.
Puppeteer is a bit heavier on the hardware though.
Consider the example for getting the page contents:
const puppeteer = require('puppeteer');
puppeteer.launch().then(async browser => {
const page = await browser.newPage();
await page.goto('https://angular.io/', {waitUntil : 'networkidle2'});
const contents = await page.content();
console.log(contents);
await browser.close();
});

How to run nightmare app in headless mode?

I have a web scraper which uses the the nightmare browser automation library. Everytime I would run my nodejs app it opens up a browser window and loads the page I am trying to scrape. But I want to run it completely in the console without any windows popping up.
The closest you are going to get is just keeping the window hidden. To do this, instantiate your Nightmare as such:
var nm = new Nightmare({show: false})
I would say "don't". Use Cheerio instead, it's built for headless HTML scraping

want to write node.js http client for web site testing

I am new to node.js
I want to try to write node.js client for my web site testing
(stuff like login, filling forms, etc...)
Which module should i use for that?
Since I want to test user login following other user functionality
it should be able to keep session like browser
Also any site where it has example of using that module?
Thanks
As Amenadiel has said in the comments, you might want to use something like Phantom.js for testing websites.
But if you're new to node.js maybe try with something light, like Zombie.js.
An example from their home page:
var Browser = require("zombie");
var assert = require("assert");
// Load the page from localhost
browser = new Browser()
browser.visit("http://localhost:3000/", function () {
// Fill email, password and submit form
browser.
fill("email", "zombie#underworld.dead").
fill("password", "eat-the-living").
pressButton("Sign Me Up!", function() {
// Form submitted, new page loaded.
assert.ok(browser.success);
assert.equal(browser.text("title"), "Welcome To Brains Depot");
})
});
Later on, when you get the hang of it, maybe switch to Phantom (which has webkit beneath, so it's not emulating the Dom).

Resources