Nightmare, PhantomJS and extracting page data

Nightmare, PhantomJS and extracting page data - node.js

I'm new to Nightmare/PhantomJS and am struggling to get a simple inventory of all the tags on a given page. I'm running on Ubuntu 14.04 after building PhantomJS from source and installing NodeJS, Nightmare and so forth manually, and other functions seem to be working as I expect.
Here's the code I'm using:
var Nightmare = require('nightmare');
new Nightmare()
.goto("http://www.google.com")
.wait()
.evaluate(function ()
{
var a = document.getElementsByTagName("*");
return(a);
},
function(i)
{
for (var index = 0; index < i.length; index++)
if (i[index])
console.log("Element " + index + ": " + i[index].nodeName);
})
.run(function(err, nightmare)
{
if (err)
console.log(err);
});
When I run this inside a "real" browser, I get a list of all the tag types on the page (HTML, HEAD, BODY, ...). When I run this using node GetTags.js, I just get a single line of output:
Element 0: HTML
I'm sure it's a newbie problem, but what am I doing wrong here?

PhantomJS has two contexts. The page context which provides access to the DOM can only be accessed through evaluate(). So, variables must be explicitly passed in and out of the page context. But there is a limitation (docs):
Note: The arguments and the return value to the evaluate function must be a simple primitive object. The rule of thumb: if it can be serialized via JSON, then it is fine.
Closures, functions, DOM nodes, etc. will not work!
Nightmare's evaluate() function is only a wrapper around the PhantomJS function of the same name. This means that you will need to work with the elements in the page context and only pass a representation to the outside. For example:
.evaluate(function ()
{
var a = document.getElementsByTagName("div");
return a.length;
},
function(i)
{
console.log(i + " divs available");
})

Related

Asynch, await, callback - What exactly is the execution context of a callback function?

So, I have previous programming experience in numerous languages: assembly(s), c, c++, basic(s), page description language(s), etc.
I am currently learning node, js, puppeteer and have run into something I can not quite make sense of.
I have read various things that seem explain various limitations of the callback execution context, but I have not found anything specifically that explains this.
I am attempting to call functions or reference variables (defined in the current module) from within a callback function. I have tried a number of variations, I have tried with variables of assorted types defined in assorted locations - but this one demonstrates the problem and I expect the solution to this will be the solution for all the variants. I am getting errors that "aFunction is not defined".
Why can't the callback function see the globally defined function "aFunction()"
function aFunction(parm)
{
return something;
}
(async () => {
let pages = await browser.pages();
// array of browser titles
var titles = [];
// iterate pages extracting each title using forloop because foreach can not contain await.
for (let index = 0; index < pages.length; index++) {
const pagex = pages[index]
const title = await pagex.title();
titles.push(title);
}
//chopped and edited a bunch to keep it simple
// here is the home of my problem.
foundAt = 0;
const container = await pages[foundAt].evaluate(() => {
let elements = $('.classiwant').toArray();
// this is the failing call
var x = aFunction(something);
for (i = 0; i < elements.length; i++) {
$(elements[i]).click();
}
})

Nodejs/Puppeteer - How to use page.evaluate

I know is a noob question, but I want to know when I should use page.evaluate
I also know the documentation exists, but I still do not understand
Can anybody give me an explanation about how and when to use this function when creating a scraper with puppeteer?

First, it is important to understand that there are two main environments:
Node.js (Puppeteer) Environment
Page DOM Environment
You should use page.evaluate() when you are seeking to interact with the page directly in the page DOM environment by passing a function and returning a <Promise<Serializable>> which resolves to the return value of the passed function.
Otherwise, if you do not use page.evaluate(), you will be dealing with elements as an ElementHandle object in the Node.js (Puppeteer) environment.
Example Usage:
const example = await page.evaluate(() => {
const elements = document.getElementsByClassName('example');
const result = [];
document.title = 'New Title';
for (let i = 0; i < elements.length; i++) {
result.push(elements[i].textContent);
}
return JSON.stringify(result);
});
See the simplified diagram below:

call back on cheerio node.js

I'm trying to write a scraper using 'request' and 'cheerio'. I have an array of 100 urls. I'm looping over the array and using 'request' on each url and then doing cheerio.load(body). If I increase i above 3 (i.e. change it to i < 3 for testing) the scraper breaks because var productNumber is undefined and I can't call split on undefined variable. I think that the for loop is moving on before the webpage responds and has time to load the body with cheerio, and this question: nodeJS - Using a callback function with Cheerio would seem to agree.
My problem is that I don't understand how I can make sure the webpage has 'loaded' or been parsed in each iteration of the loop so that I don't get any undefined variables. According to the other answer I don't need a callback, but then how do I do it?
for (var i = 0; i < productLinks.length; i++) {
productUrl = productLinks[i];
request(productUrl, function(err, resp, body) {
if (err)
throw err;
$ = cheerio.load(body);
var imageUrl = $("#bigImage").attr('src'),
productNumber = $("#product").attr('class').split(/\s+/)[3].split("_")[1]
console.log(productNumber);
});
};
Example of output:
1461536
1499543
TypeError: Cannot call method 'split' of undefined

Since you're not creating a new $ variable for each iteration, it's being overwritten when a request is completed. This can lead to undefined behaviour, where one iteration of the loop is using $ just as it's being overwritten by another iteration.
So try creating a new variable:
var $ = cheerio.load(body);
^^^ this is the important part
Also, you are correct in assuming that the loop continues before the request is completed (in your situation, it isn't cheerio.load that is asynchronous, but request is). That's how asynchronous I/O works.
To coordinate asynchronous operations you can use, for instance, the async module; in this case, async.eachSeries might be useful.

You are scraping some external site(s). You can't be sure the HTML all fits exactly the same structure, so you need to be defensive on how you traverse it.
var product = $('#product');
if (!product) return console.log('Cannot find a product element');
var productClass = product.attr('class');
if (!productClass) return console.log('Product element does not have a class defined');
var productNumber = productClass.split(/\s+/)[3].split("_")[1];
console.log(productNumber);
This'll help you debug where things are going wrong, and perhaps indicate that you can't scrape your dataset as easily as you'd hoped.

Chrome extension only opens last assigned url

I have a chrome extension browser action that I want to have list a series of links, and open any selected link in the current tab. So far what I have is this, using jquery:
var url = urlForThisLink;
var li = $('<li/>');
var ahref = $('' + title + '');
ahref.click(function(){
chrome.tabs.getSelected(null, function (tab) {
chrome.tabs.update(tab.id, {url: url});
});
});
li.append(ahref);
It partially works. It does navigate the current tab, but will only navigate to whichever link was last created in this manner. How can I do this for an iterated series of links?

#jmort253's answer is actually a good illustration of what is probably your error. Despite being declared inside the for loop, url has function scope since it is declared with var. So your click handler closure is binding to a variable scoped outside the for loop, and every instance of the closure uses the same value, ie. the last one.
Once Chrome supports the let keyword you will be able to use it instead of var and it will work fine since url will be scoped to the body of the for loop. In the meantime you'll have to create a new scope by creating your closure in a function:
function makeClickHandler(url) {
return function() { ... };
}
Inside the for loop say:
for (var i = 0; i < urls.length; i++) {
var url = urls[i];
...
ahref.click(makeClickHandler(url));
...
}

In your code example, it looks like you only have a single link. Instead, let's assume you have an actual collection of links. In that case, you can use a for loop to iterate through them:
// collection of urls
var urls = ["http://example.com", "http://domain.org"];
// loop through the collection, for each url, build a separate link.
for(var i = 0; i < urls.length; i++) {
// this is the link for iteration i
var url = urls[i];
var li = $('<li/>');
var ahref = $('' + title + '');
ahref.click( (function(pUrl) {
return function() {
chrome.tabs.getSelected(null, function (tab) {
chrome.tabs.update(tab.id, {url: pUrl});
});
}
})(url));
li.append(ahref);
}
I totally forgot about scope when writing the original answer, so I updated it to use a closure based on Matthew Gertner's answer. Basically, in the click event handler, I'm now passing in the url variable into an anonymous 1 argument function which returns another function. The returned function uses the argument passed into the anonymous function, so its state is unaffected by the fact that the next iterations of the for loop will change the value of url.

Crawling with Node.js

Complete Node.js noob, so dont judge me...
I have a simple requirement. Crawl a web site, find all the product pages, and save some data from the product pages.
Simpler said then done.
Looking at Node.js samples, i cant find something similar.
There a request scraper:
request({uri:'http://www.google.com'}, function (error, response, body) {
if (!error && response.statusCode == 200) {
var window = jsdom.jsdom(body).createWindow();
jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
// jQuery is now loaded on the jsdom window created from 'body'
jQuery('.someClass').each(function () { /* Your custom logic */ });
});
}
});
But i cant figure out how to call it self once it scrapes the root page, or to populate an array or url's that it needs to scrape.
Then there's the http agent way:
var agent = httpAgent.create('www.google.com', ['finance', 'news', 'images']);
agent.addListener('next', function (err, agent) {
var window = jsdom.jsdom(agent.body).createWindow();
jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
// jQuery is now loaded on the jsdom window created from 'agent.body'
jquery('.someClass').each(function () { /* Your Custom Logic */ });
agent.next();
});
});
agent.addListener('stop', function (agent) {
sys.puts('the agent has stopped');
});
agent.start();
Which takes an array of locations, but then again, once you get it started with an array, you cant add more locations to it to go through all the product pages.
And i cant even get Apricot working, for some reason i'm getting an error.
So, how do i modify any of the above examples (or anything not listed above) to scrape a site, find all the product pages, find some data in there (the jquery.someclass example should do the trick) and that save that to a db?
Thanks!

Personally, I use Node IO to scrape some websites. https://github.com/chriso/node.io
More details about scraping can be found in the wiki !

I've had pretty good success crawling and scraping with Casperjs. It's a pretty nice library built on top of Phantomjs. I like it because it's fairly succinct. Callbacks can be executed as foo.then() which is super-simple to understand and I even can use jQuery since Phantomjs is an implementation of webkit. For example, the following would instantiate an instance of Casper and push all links on an archive page to an array called 'links'.
var casper = require("casper").create();
var numberOfLinks = 0;
var currentLink = 0;
var links = [];
var buildPage, capture, selectLink, grabContent, writeContent;
casper.start("http://www.yoursitehere.com/page_to/scrape/", function() {
numberOfLinks = this.evaluate(function() {
return __utils__.findAll('.nav-selector a').length;
});
this.echo(numberOfLinks + " items found");
// cause jquery makes it easier
casper.page.injectJs('/PATH/TO/jquery.js');
});
// Capture links
capture = function() {
links = this.evaluate(function() {
var link = [];
jQuery('.nav-selector a').each(function() {
link.push($(this).attr('href'));
});
return link;
});
this.then(selectLink);
};
You can then use node fs (or whatever else you want, really) to push your data into XML, CSV, or whatever you want. The example for scraping BBC photos was exceptionally helpful when I built my scraper.
This is a view from 10,000 feet of what casper can do. It has a very potent and broad API. I dig it, in case you couldn't tell :).
My full scraping example is here: https://gist.github.com/imjared/5201405.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Nightmare, PhantomJS and extracting page data - node.js

Related

Asynch, await, callback - What exactly is the execution context of a callback function?

Nodejs/Puppeteer - How to use page.evaluate

call back on cheerio node.js

Chrome extension only opens last assigned url

Crawling with Node.js

Categories

Resources