Scraping dynamic data of a web page in nodejs - node.js

By using node.js I am trying to scrape a web page. For this, I am using cheerio and tinyreq modules. My source code is as follows:
// scrape function
function scrape(url, data, cb) {
req(url, (err, body) => {
if (err) { return cb(err); }
let $ = cheerio.load(body)
, pageData = {};
Object.keys(data).forEach(k => {
pageData[k] = $(data[k]).text();
});
cb(null, pageData);
});
}
scrape("https://www.activecubs.com/activity-wheel/", {
title: ".row h1"
, description: ".row h2"
}, (err, data) => {
console.log(err || data);
});
In my code, the text in the h1 tag is static and in the h2 tag, it is dynamic. While I run the code, I am only getting the static data i.e., the description field data is empty.By following previous StackOverflow questions, I tried using phantom js to overcome this issue but it doesn't work for me. The dynamic data here is the data which is obtained by rotating a wheel. For any doubts on the website I am using, you can check https://www.activecubs.com/activity-wheel/.

Cheerio documentation is pretty clear
https://github.com/cheeriojs/cheerio#cheerio-is-not-a-web-browser
see also https://github.com/segmentio/nightmare

User action can be performed using SpookyJS
SpookyJS makes it possible to drive CasperJS suites from Node.js. At a high level, Spooky accomplishes this by spawning Casper as a child process and controlling it via RPC.
Specifically, each Spooky instance spawns a child Casper process that runs a bootstrap script. The bootstrap script sets up a JSON-RPC server that listens for commands from the parent Spooky instance over a transport (either HTTP or stdio). The script also sets up a JSON-RPC client that sends events to the parent Spooky instance via stdout. Check the documentation
Example

Related

Abstracting the superagent

Our application consists of nodejs, express, reactjs, and newforms.
To make rest calls we are using :
var RestClient = require('superagent-ls')
And we are making rest calls like:
cleanBirthDate(callback) {
var {birthDate} = this.cleanedData
var formattedDob = moment (birthDate).format('DDMMYYYY')
RestClient.get(Global.getBirthDateServiceUrl() + '/' + formattedDob)
.end((err, res) => {
if (err) {
callback (err)
}
else if (res.clientError) {
var message = errorsMappingSwitch(res.body.error)
callback(null, forms.ValidationError(message))
}
else {
callback(null)
}
})
},
We want to move the RestClient related code to our own file say RestCleint.js and then require it and use it across the application. By doing so we can apply some generalised code(like error handling, logging, redirect to specific error pages depending on the error code) in one place.
Appreciate any help in this direction.
I did the exact same thing you require (even with using superagent). I created modules with the API code in a /utils folder and required them where applicable. For even more abstraction we're using CoffeeScript to create classes that inherit from a BaseAPIObject and invoke using something like API.Posts.getAll().end() etc.
This article was very helpful in understanding how to write your own modules: Export This: Interface Design Patterns for Node.js Modules.
you can always require it like
RestClient.js
export default function callApi(callback) {
//your rest code
// use the callback here in the callback of your call.
}
app.js
import {callApi} from './RestClient';
callApi((err, result) => {
if (err) console.log(err)
});

Nodejs + mikeal/Request module, how to close request or increase MaxSockets

I have a Nodejs app that's designed to perform simple end-to-end testing of a large web application. This app uses the mikeal/Request and Cheerio modules to navigate, request, traverse and inspect web pages across the application.
We are refactoring some tests, and are hitting a problem when multiple request functions are called in series. I believe this may be due to the Node.js process hitting the MaxSockets limit, but am not entirely sure.
Some code...
var request = require('request');
var cheerio = require('cheerio);
var async = require('async');
var getPages_FromMenuLinks = function() {
var pageUrl = 'http://www.example.com/app';
async.waterfall([
function topPageRequest(cb1) {
var menuLinks = [];
request(pageUrl, function(err, resp, page) {
var $ = cheerio.load(page);
$('div[class*="sub-menu"]').each(function (i, elem) {
menuLinks.push($(this).find('a').attr('href');
});
cb1(null, menuLinks);
});
}, function subMenuRequests(menuLinks, cb2) {
async.eachSeries(menuLinks, functionv(link, callback) {
request(link, function(err, resp, page) {
var $ = cheerio.load(page);
// do some quick validation testing of elements on the expected page
callback();
});
}, function() { cb2(null) } );
}
], function () { });
};
module.export = getPages_FromMenuLinks;
Now, if I run this Node script, it runs through the first topPageRequest and starts the subMenuRequests, but then freezes after completing the request for the third sub-menu item.
It seems that I might be hitting a Max-Sockets limit, either on Node or my machine (?) -- I'm testing this on standard Windows 8 machine, running Node v0.10.26.
I've tried using request({pool:{maxSockets:25}, url:link}, function(err, resp..., but it does not seem to make any difference.
It also seems there's a way to abort the request object, if I first instantiate it (as found here). But I have no idea how I would "parse" the page, similar to what's happening in the above code. In other words, from the solution found in the link...
var theRequest = request({ ... });
theRequest.pipe(parser);
theRequest.abort();
..., how would I re-write my code to pipe and "parse" the request?
You can make easily thousands requests at the same time (e.g. from a single for loop) and they will be queued and terminate automatically one by one, once a particular request is served.
I think by default there are 5 sockets per domain and this limit in your case should be more than enough.
It is highly probably that your server does not handle your requests properly (e.g. on error they are not terminated and hung up indefinitely).
There are three steps you can make to find out what is going on:
check if you are sending proper request -- as #mattyice observed there are some bugs in your code.
investigate server code and the way your requests are handled there -- for me it seems that the server does not serve/terminate them in first place.
try to use setTimeout when sending the request. 5000ms should be a reasonable amount of time to wait. On timeout the request will be aborted with appropriate error code.
As an advice: I would recommend to use some more suitable, easier in use and more accurate tools to do your testing: e.g. phantomjs.

How do I make HTTP requests inside a loop in NodeJS

I'm writing a command line script in Node (because I know JS and suck at Bash + I need jQuery for navigating through DOM)… right now I'm reading an input file and I iterate over each line.
How do I go about making one HTTP request (GET) per line so that I can load the resulting string with jQuery and extract the information I need from each page?
I've tried using the NPM httpsync package… so I could make one blocking GET call per line of my input file but it doesn't support HTTPS and of course the service I'm hitting only supports HTTPS.
Thanks!
A good way to handle a large number of jobs in a conrolled manner is the async queue.
I also recommend you look at request for making HTTP requests and cheerio for dealing with the HTML you get.
Putting these together, you get something like:
var q = async.queue(function (task, done) {
request(task.url, function(err, res, body) {
if (err) return done(err);
if (res.statusCode != 200) return done(res.statusCode);
var $ = cheerio.load(body);
// ...
done();
});
}, 5);
Then add all your URLs to the queue:
q.push({ url: 'https://www.example.com/some/url' });
// ...
I would most likely use the async library's function eachLimit function. That will allow you to throttle the number of active connections as well as getting a callback for when all the operations are done.
async.eachLimit(urls, function(url, done) {
request(url, function(err, res, body) {
// do something
done();
});
}, 5, function(err) {
// do something
console.log('all done!');
})
I was worried about making a million simultaneous requests without putting in some kind of throttling/limiting the number of concurrent connections, but it seems like Node is throttling me "out of the box" to something around 5-6 concurrent connections.
This is perfect, as it lets me keep my code a lot simpler while also fully leveraging the inherent asynchrony of Node.

Meteor client synchronous server database calls

I am building an application in Meteor that relies on real time updates from the database. The way Meteor has laid out the examples is to have the database call under the Template call. I've found that when dealing with medium sized datasets this becomes impractical. I am trying to move the request to the server, and have the results passed back to the client.
I have looked at similar questions on SA but have found no immediate answers.
Here is my server side function:
Meteor.methods({
"getTest" : function() {
var res = Data.find({}, { sort : { time : -1 }, limit : 10 });
var r = res.fetch();
return (r);
}
});
And client side:
Template.matches._matches = function() {
var res= {};
Meteor.call("getTest", function (error, result) {
res = result;
});
return res;
}
I have tried variations of the above code - returning in the callback function as one example. As far as I can tell, having a callback makes the function asynchronous, so it cannot be called onload (synchronously) and has to be invoked from the client.
I would like to pass all database queries server side to lighten the front end load. Is this possible in Meteor?
Thanks
The way to do this is to use subscriptions instead of remote method calls. See the counts-by-room example in the docs. So, for every database call you have a collection that exists client-side only. The server then decides the records in the collection using set and unset.

Crawling with Node.js

Complete Node.js noob, so dont judge me...
I have a simple requirement. Crawl a web site, find all the product pages, and save some data from the product pages.
Simpler said then done.
Looking at Node.js samples, i cant find something similar.
There a request scraper:
request({uri:'http://www.google.com'}, function (error, response, body) {
if (!error && response.statusCode == 200) {
var window = jsdom.jsdom(body).createWindow();
jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
// jQuery is now loaded on the jsdom window created from 'body'
jQuery('.someClass').each(function () { /* Your custom logic */ });
});
}
});
But i cant figure out how to call it self once it scrapes the root page, or to populate an array or url's that it needs to scrape.
Then there's the http agent way:
var agent = httpAgent.create('www.google.com', ['finance', 'news', 'images']);
agent.addListener('next', function (err, agent) {
var window = jsdom.jsdom(agent.body).createWindow();
jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
// jQuery is now loaded on the jsdom window created from 'agent.body'
jquery('.someClass').each(function () { /* Your Custom Logic */ });
agent.next();
});
});
agent.addListener('stop', function (agent) {
sys.puts('the agent has stopped');
});
agent.start();
Which takes an array of locations, but then again, once you get it started with an array, you cant add more locations to it to go through all the product pages.
And i cant even get Apricot working, for some reason i'm getting an error.
So, how do i modify any of the above examples (or anything not listed above) to scrape a site, find all the product pages, find some data in there (the jquery.someclass example should do the trick) and that save that to a db?
Thanks!
Personally, I use Node IO to scrape some websites. https://github.com/chriso/node.io
More details about scraping can be found in the wiki !
I've had pretty good success crawling and scraping with Casperjs. It's a pretty nice library built on top of Phantomjs. I like it because it's fairly succinct. Callbacks can be executed as foo.then() which is super-simple to understand and I even can use jQuery since Phantomjs is an implementation of webkit. For example, the following would instantiate an instance of Casper and push all links on an archive page to an array called 'links'.
var casper = require("casper").create();
var numberOfLinks = 0;
var currentLink = 0;
var links = [];
var buildPage, capture, selectLink, grabContent, writeContent;
casper.start("http://www.yoursitehere.com/page_to/scrape/", function() {
numberOfLinks = this.evaluate(function() {
return __utils__.findAll('.nav-selector a').length;
});
this.echo(numberOfLinks + " items found");
// cause jquery makes it easier
casper.page.injectJs('/PATH/TO/jquery.js');
});
// Capture links
capture = function() {
links = this.evaluate(function() {
var link = [];
jQuery('.nav-selector a').each(function() {
link.push($(this).attr('href'));
});
return link;
});
this.then(selectLink);
};
You can then use node fs (or whatever else you want, really) to push your data into XML, CSV, or whatever you want. The example for scraping BBC photos was exceptionally helpful when I built my scraper.
This is a view from 10,000 feet of what casper can do. It has a very potent and broad API. I dig it, in case you couldn't tell :).
My full scraping example is here: https://gist.github.com/imjared/5201405.

Resources