I would like to read the entire content of a fully rendered webpage into nodejs and do some stuff with the content.
At the moment I am using PhantomJS but it is so unstable. It crashes every 10-20 pages and it leaks memory like crazy. (from 300MB to 2.8GB after just 15 pages)
Its the same on our Ubuntu server - it runs for 10-20 pages and then crashes.
I can see a lot of other people out there have exact same problem with PhantomJS.
So I wondered... what are the alternatives?
Anyone here knows about how to fix PhamtomJS or knows another simple stable component which can read a rendered webpage and put it into a variable in nodejs?
Any help will be MUCH appreciated - I wasted over 100 hours trying to get PhantomJS to work (new instance for each page, re-using same instance, turning down the speed using timeouts etc etc etc... no matter what it still leaks and still crashes).
In the past when scraping heavy sites I achieved good results cancelling some of the requests made to 3d party sites, like Google Maps, Facebook and Twitter widgets, ad distributors and such, see here in more detail.
But nowadays I just suggest puppeteer. It's a native node module, it uses the latest Chromium as a browser and it is being continuously developed by Google engineers. API ideology is based on that of PhantomJS. Usage in node 8+ with async/await provides the most satisfying scraping experience.
Puppeteer is a bit heavier on the hardware though.
Consider the example for getting the page contents:
const puppeteer = require('puppeteer');
puppeteer.launch().then(async browser => {
const page = await browser.newPage();
await page.goto('https://angular.io/', {waitUntil : 'networkidle2'});
const contents = await page.content();
console.log(contents);
await browser.close();
});
I am currently debugging a Nodejs Express application and I was wondering if there was a way to inspect the console log object similar to how you would do when developing web applications in Chrome or Firefox.
I.E:
var myObj = [{"hello": "world"}];
console.log(myObj);
Inspect Object:
Below is an example of a console.log() I am trying to inspect from Express:
Check out node-inspector to debug your Node.js script in a browser.
Here's a link to a video showing how to use the module.
I'm trying to run some javascript code (originally developed for browser) in node.js environment.
I use createDocumentFragment in order to minimize the node access.
(Obviously it is to create Dom elements in the document body)
I can run $.append using cheerio as $ in node.js.
Is there a way to run createDocumentFragment in node.js?
No because node.js initializes it in jscript not in DOM
I am new to node.js
I want to try to write node.js client for my web site testing
(stuff like login, filling forms, etc...)
Which module should i use for that?
Since I want to test user login following other user functionality
it should be able to keep session like browser
Also any site where it has example of using that module?
Thanks
As Amenadiel has said in the comments, you might want to use something like Phantom.js for testing websites.
But if you're new to node.js maybe try with something light, like Zombie.js.
An example from their home page:
var Browser = require("zombie");
var assert = require("assert");
// Load the page from localhost
browser = new Browser()
browser.visit("http://localhost:3000/", function () {
// Fill email, password and submit form
browser.
fill("email", "zombie#underworld.dead").
fill("password", "eat-the-living").
pressButton("Sign Me Up!", function() {
// Form submitted, new page loaded.
assert.ok(browser.success);
assert.equal(browser.text("title"), "Welcome To Brains Depot");
})
});
Later on, when you get the hang of it, maybe switch to Phantom (which has webkit beneath, so it's not emulating the Dom).
I want to perform following actions at the server side:
1) Scrape a webpage
2) Simulate a click on that page and then navigate to the new page.
3) Scrape the new page
4) Simulate some button clicks on the new page
5) Sending the data back to the client via json or something
I am thinking of using it with Node.js.
But am confused as to which module should i use
a) Zombie
b) Node.io
c) Phantomjs
d) JSDOM
e) Anything else
I have installed node,io but am not able to run it via command prompt.
PS: I am working in windows 2008 server
Zombie.js and Node.io run on JSDOM, hence your options are either going with JSDOM (or any equivalent wrapper), a headless browser (PhantomJS, SlimerJS) or Cheerio.
JSDOM is fairly slow because it has to recreate DOM and CSSOM in Node.js.
PhantomJS/SlimerJS are proper headless browsers, thus performances are ok and those are also very reliable.
Cheerio is a lightweight alternative to JSDOM. It doesn't recreate the entire page in Node.js (it just downloads and parses the DOM - no javascript is executed). Therefore you can't really click on buttons/links, but it's very fast to scrape webpages.
Given your requirements, I'd probably go with something like a headless browser. In particular, I'd choose CasperJS because it has a nice and expressive API, it's fast and reliable (it doesn't need to reinvent the wheel on how to parse and render the dom or css like JSDOM does) and it's very easy to interact with elements such as buttons and links.
Your workflow in CasperJS should look more or less like this:
casper.start();
casper
.then(function(){
console.log("Start:");
})
.thenOpen("https://www.domain.com/page1")
.then(function(){
// scrape something
this.echo(this.getHTML('h1#foobar'));
})
.thenClick("#button1")
.then(function(){
// scrape something else
this.echo(this.getHTML('h2#foobar'));
})
.thenClick("#button2")
thenOpen("http://myserver.com", {
method: "post",
data: {
my: 'data',
}
}, function() {
this.echo("data sent back to the server")
});
casper.run();
Short answer (in 2019): Use puppeteer
If you need a full (headless) browser, use puppeteer instead of PhantomJS as it offers an up-to-date Chromium browser with a rich API to automate any browser crawling and scraping tasks. If you only want to parse a HTML document (without executing JavaScript inside the page) you should check out jsdom and cheerio.
Explanation
Tools like jsdom (or cheerio) allow it to extract information from a HTML document by parsing it. This is fast and works well as long as the website does not contain JavaScript. It will be very hard or even impossible to extract information from a website built on JavaScript. jsdom, for example, is able to execute scripts, but runs them inside a sandbox in your Node.js environment, which can be very dangerous and possibly crash your application. To quote the docs:
However, this is also highly dangerous when dealing with untrusted content.
Therefore, to reliably crawl more complex websites, you need an actual browser. For years, the most popular solution for this task was PhantomJS. But in 2018, the development of PhantomJS was offically suspended. Thankfully, since April 2017 the Google Chrome team makes it possible to run the Chrome browser headlessly (announcement).
This makes it possible to crawl websites using an up-to-date browser with full JavaScript support.
To control the browser, the library puppeteer, which is also maintained by Google developers, offers a rich API for use within the Node.js environment.
Code sample
The lines below, show a simple example. It uses Promises and the async/await syntax to execute a number of tasks. First, the browser is started (puppeteer.launch) and a URL is opened page.goto.
After that, a functions like page.evaluate and page.click are used to extract information and execute actions on the page. Finally, the browser is closed (browser.close).
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
// example: get innerHTML of an element
const someContent = await page.$eval('#selector', el => el.innerHTML);
// Use Promise.all to wait for two actions (navigation and click)
await Promise.all([
page.waitForNavigation(), // wait for navigation to happen
page.click('a.some-link'), // click link to cause navigation
]);
// another example, this time using the evaluate function to return innerText of body
const moreContent = await page.evaluate(() => document.body.innerText);
// click another button
await page.click('#button');
// close brower when we are done
await browser.close();
})();
The modules you listed do the following:
Phantomjs/Zombie - simulate browser (headless - nothing is actually displayed). Can be used for scraping static or dynamic. Or testing of your html pages.
Node.io/jsdom - webscraping : extracting data from page (static).
Looking at your requirements, you could use phantom or zombie.