Crawling with Node.js

Crawling with Node.js - node.js

Complete Node.js noob, so dont judge me...
I have a simple requirement. Crawl a web site, find all the product pages, and save some data from the product pages.
Simpler said then done.
Looking at Node.js samples, i cant find something similar.
There a request scraper:
request({uri:'http://www.google.com'}, function (error, response, body) {
if (!error && response.statusCode == 200) {
var window = jsdom.jsdom(body).createWindow();
jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
// jQuery is now loaded on the jsdom window created from 'body'
jQuery('.someClass').each(function () { /* Your custom logic */ });
});
}
});
But i cant figure out how to call it self once it scrapes the root page, or to populate an array or url's that it needs to scrape.
Then there's the http agent way:
var agent = httpAgent.create('www.google.com', ['finance', 'news', 'images']);
agent.addListener('next', function (err, agent) {
var window = jsdom.jsdom(agent.body).createWindow();
jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
// jQuery is now loaded on the jsdom window created from 'agent.body'
jquery('.someClass').each(function () { /* Your Custom Logic */ });
agent.next();
});
});
agent.addListener('stop', function (agent) {
sys.puts('the agent has stopped');
});
agent.start();
Which takes an array of locations, but then again, once you get it started with an array, you cant add more locations to it to go through all the product pages.
And i cant even get Apricot working, for some reason i'm getting an error.
So, how do i modify any of the above examples (or anything not listed above) to scrape a site, find all the product pages, find some data in there (the jquery.someclass example should do the trick) and that save that to a db?
Thanks!

Personally, I use Node IO to scrape some websites. https://github.com/chriso/node.io
More details about scraping can be found in the wiki !

I've had pretty good success crawling and scraping with Casperjs. It's a pretty nice library built on top of Phantomjs. I like it because it's fairly succinct. Callbacks can be executed as foo.then() which is super-simple to understand and I even can use jQuery since Phantomjs is an implementation of webkit. For example, the following would instantiate an instance of Casper and push all links on an archive page to an array called 'links'.
var casper = require("casper").create();
var numberOfLinks = 0;
var currentLink = 0;
var links = [];
var buildPage, capture, selectLink, grabContent, writeContent;
casper.start("http://www.yoursitehere.com/page_to/scrape/", function() {
numberOfLinks = this.evaluate(function() {
return __utils__.findAll('.nav-selector a').length;
});
this.echo(numberOfLinks + " items found");
// cause jquery makes it easier
casper.page.injectJs('/PATH/TO/jquery.js');
});
// Capture links
capture = function() {
links = this.evaluate(function() {
var link = [];
jQuery('.nav-selector a').each(function() {
link.push($(this).attr('href'));
});
return link;
});
this.then(selectLink);
};
You can then use node fs (or whatever else you want, really) to push your data into XML, CSV, or whatever you want. The example for scraping BBC photos was exceptionally helpful when I built my scraper.
This is a view from 10,000 feet of what casper can do. It has a very potent and broad API. I dig it, in case you couldn't tell :).
My full scraping example is here: https://gist.github.com/imjared/5201405.

Related

Scraping dynamic data of a web page in nodejs

By using node.js I am trying to scrape a web page. For this, I am using cheerio and tinyreq modules. My source code is as follows:
// scrape function
function scrape(url, data, cb) {
req(url, (err, body) => {
if (err) { return cb(err); }
let $ = cheerio.load(body)
, pageData = {};
Object.keys(data).forEach(k => {
pageData[k] = $(data[k]).text();
});
cb(null, pageData);
});
}
scrape("https://www.activecubs.com/activity-wheel/", {
title: ".row h1"
, description: ".row h2"
}, (err, data) => {
console.log(err || data);
});
In my code, the text in the h1 tag is static and in the h2 tag, it is dynamic. While I run the code, I am only getting the static data i.e., the description field data is empty.By following previous StackOverflow questions, I tried using phantom js to overcome this issue but it doesn't work for me. The dynamic data here is the data which is obtained by rotating a wheel. For any doubts on the website I am using, you can check https://www.activecubs.com/activity-wheel/.

Cheerio documentation is pretty clear
https://github.com/cheeriojs/cheerio#cheerio-is-not-a-web-browser
see also https://github.com/segmentio/nightmare

User action can be performed using SpookyJS
SpookyJS makes it possible to drive CasperJS suites from Node.js. At a high level, Spooky accomplishes this by spawning Casper as a child process and controlling it via RPC.
Specifically, each Spooky instance spawns a child Casper process that runs a bootstrap script. The bootstrap script sets up a JSON-RPC server that listens for commands from the parent Spooky instance over a transport (either HTTP or stdio). The script also sets up a JSON-RPC client that sends events to the parent Spooky instance via stdout. Check the documentation
Example

NodeJS readdir() function always being run twice

I've been trying to pick up NodeJS and learning more for backend development purposes. I can't seem to wrap my mind around Async tasks though and I have an example here that I've spent hours over trying to search for the solution.
app.get('/initialize_all_pictures', function(req, res){
var path = './images/';
fs.readdir(path, function(err, items){
if (err){
console.log("there was an error");
return;
}
console.log(items.length);
for(var i = 0; i<items.length; i++){
var photo = new Photo(path + items[i], 0, 0,Math.floor(Math.random()*1000))
photoArray.push(photo);
}
});
res.json({"Success" : "Done"});
});
Currently, I have this endpoint that is supposed to look through a directory called images and create "Photo" objects and push it into a global array called PhotoArray. It works, except the function for readdir is always being called twice.
console.log would always give output of
2
2
(I have two items in the directory).
Why is this?

Just figured out the problem.
I had a chrome extension that would help me format JSON values from HTTP requests. Unfortunately, the extension actually made an additional call to the endpoint therefore whenever I would point my browser to the endpoint, the function would end up getting called twice!

Getting data pushed to an array outside of a Promise

I'm using https://github.com/Haidy777/node-youtubeAPI-simplifier to grab some information from a playlist of Bounty Killers. The way, this library is setup seems to use Promise via Bluebird (https://github.com/petkaantonov/bluebird) which I don't know much about. Looking up the Beginner's Guide for BlueBird gives http://bluebirdjs.com/docs/beginners-guide.html which literally just shows
This article is partially or completely unfinished. You are welcome to create pull requests to help completing this article.
I am able to set up the library
var ytapi = require('node-youtubeapi-simplifier');
ytapi.setup('My Server Key');
As well as list some information about Bounty Killers
ytdata = [];
ytapi.playlistFunctions.getVideosForPlaylist('PLCCB0BFBF2BB4AB1D')
.then(function (data) {
for (var i = 0, len = data.length; i < len; i++) {
ytapi.videoFunctions.getDetailsForVideoIds([data[i].videoId])
.then(function (video) {
console.log(video);
// ytdata.push(video); <- Push a Bounty Killer Video
});
}
});
// console.log(ytdata); This gives []
Basically the above pulls the full playlist (normally there will be some pagination here depending on the length) then it takes the data from getVideosForPlaylist iterates the list and calls getDetailsForVideoIds for each YouTube video. All good here.
The issues arises with getting data out of this. I would like to push the video object to ytdata array and I'm unsure whether the empty array at the end is due to scoping or some out of sync such that console.log(ytdata) gets called before the API calls are finished.
How will I be able to get each Bounty Killer video into the ytdata array to be available globally?

console.log(ytdata) gets called before the API calls are finished
Spot on, that's exactly what's happening here, the API calls are async. Once you're using async functions, you must go the async way if you want to deal with the returned data. Your code could be written like this:
var ytapi = require('node-youtubeapi-simplifier');
ytapi.setup('My Server Key');
// this function return a promise you can "wait"
function getVideos() {
return ytapi.playlistFunctions
.getVideosForPlaylist('PLCCB0BFBF2BB4AB1D')
.then(function (videos) {
// extract all videoIds
var videoIds = videos.map(video => video.videoId);
// getDetailsForVideoIds is called with an array of videoIds
// and return a promise, one API call is enough
return ytapi.videoFunctions.getDetailsForVideoIds(videoIds);
});
}
getVideos().then(function (ydata) {
// this is the only place ydata is full of data
console.log(ydata);
});
I made use of ES6's arrow function in videos.map(video => video.videoId);, that should work if your nodejs is v4+.

console.log(ytdata) should be immediately AFTER your FOR loop. This data is NOT available until the promise is resolved and the FOR loop execution is complete and attempting to access it beforehand will give you an empty array.
(your current console.log is not working because that code is being executed immediately before the promise is resolved). Only code inside the THEN block is executed AFTER the promise is resolved.
If you NEED the data available NOW or ASAP and the requests for the videos is taking a long time then can you request 1 video at a time or on demand or on a separate thread (using a webworker maybe)? Can you implement caching?
Can you make the requests up front behind the scenes before the user even visits this page? (not sure this is a good idea but it is an idea)
Can you use video thumbnails (like youtube does) so that when the thumbnail is clicked then you start streaming and playing the video?
Some ideas ... Hope this helps

ytdata = [];
ytapi.playlistFunctions.getVideosForPlaylist('PLCCB0BFBF2BB4AB1D')
.then(function (data) {
// THE CODE INSIDE THIS THEN BLOCK IS EXECUTED WHEN ALL THE VIDEO IDS HAVE BEEN RETRIEVED AND ARE AVAILABLE
// YOU COULD SAVE THESE TO A DATASTORE IF YOU WANT
for (var i = 0, len = data.length; i < len; i++) {
var videoIds = [data[i].videoId];
ytapi.videoFunctions.getDetailsForVideoIds(videoIds)
.then(function (video) {
// THE CODE INSIDE THIS THEN BLOCK IS EXECUTED WHEN ALL THE DETAILS HAVE BEEN DOWNLOADED FOR ALL videoIds provided
// AGAIN YOU CAN DO WHATEVER YOU WANT WITH THESE DETAILS
// ALSO NOW THAT THE DATA IS AVAILABLE YOU MIGHT WANT TO HIDE THE LOADING ICON AND RENDER THE PAGE! AGAIN JUST AN IDEA, A DATA STORE WOULD PROVIDE FASTER ACCESS BUT YOU WOULD NEED TO UPDATE THE CACHE EVERY SO OFTEN
// ytdata.push(video); <- Push a Bounty Killer Video
});
// THE DETAILS FOR ANOTHER VIDEO BECOMES AVAILABLE AFTER EACH ITERATION OF THE FOR LOOP
}
// ALL THE DATA IS AVAILABLE WHEN THE FOR LOOP HAS COMPLETED
});
// This is executed immediately before YTAPI has responded.
// console.log(ytdata); This gives []

Zombie.js check dynamic updates

I am trying to scrape content from a web page that is continuously changing. I have been able to use PhantomJS to achieve this however wanted a lighter weight solution. The following code gets the correct value the first time it prints to the console. However on following iterations the same value is printed. Any ideas?
var Browser = require("zombie");
var assert = require("assert");
// Load the page from localhost
browser = new Browser()
browser.visit("http://www.timeanddate.com/worldclock/usa/los-angeles", function () {
setInterval(function(){
console.log(browser.text('#ct'));
},10000);
});
Note the example above is purely an example. I know this would be the most inefficient way to get the time in Los Angeles.

Once you call browser.visit(), the browser stores the response, but unless you call it multiple times, the response won't change. See it for yourself:
browser.visit("http://www.timeanddate.com/worldclock/usa/los-angeles", function () {
console.log(browser.html()); // will print the HTML to stdout
});
So what you probably want is to call browser.visit() more than once, maybe inside setInterval() (although there may be more robust solutions out there).
I readapted your code:
var Browser = require("zombie");
var assert = require("assert");
var browser = new Browser();
setInterval(function () {
browser.visit("http://www.timeanddate.com/worldclock/usa/los-angeles", function () {
console.log(browser.text('#ct'));
});
}, 10000);

Incremental and non-incremental urls in node js with cheerio and request

I am trying to scrape data from a page using cheerio and request in the following way:
1) go to url 1a (http://example.com/0)
2) extract url 1b (http://example2.com/52)
3) go to url 1b
4) extract some data and save
5) go to url 1a+1 (http://example.com/1, let's call it 2a)
6) extract url 2b (http://example2.com/693)
7) go to url 2b
8) extract some data and save etc...
I am struggling work out how to do this (note, I only am familiar with node js and cheerio/request for this task even though it is likely not elegant, so am not looking for alternative libraries or languages to do this in, sorry). I think I am missing something because I can't even think how this could work.
EDIT
Let me try this in another way. here is the first part of code:
var request = require('request'),
cheerio = require('cheerio');
request('http://api.trove.nla.gov.au/result?key=6k6oagt6ott4ohno&zone=book&l-advformat=Thesis&sortby=dateDesc&q=+date%3A[2000+TO+2014]&l-availability=y&l-australian=y&n=1&s=0', function(error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html, {
xmlMode: true
});
var id = ($('work').attr('id'))
var total = ($('record').attr('total'))
}
});
The first returned page looks like this
<response>
<query>date:[2000 TO 2014]</query>
<zone name="book">
<records s="0" n="1" total="69977" next="/result?l-advformat=Thesis&sortby=dateDesc&q=+date%3A%5B2000+TO+2014%5D&l-availability=y&l-australian=y&n=1&zone=book&s=1">
<work id="189231549" url="/work/189231549">
<troveUrl>http://trove.nla.gov.au/work/189231549</troveUrl>
<title>
Design of physiological control and magnetic levitation systems for a total artificial heart
</title>
<contributor>Greatrex, Nicholas Anthony</contributor>
<issued>2014</issued>
<type>Thesis</type>
<holdingsCount>1</holdingsCount>
<versionCount>1</versionCount>
<relevance score="0.001961126">vaguely relevant</relevance>
<identifier type="url" linktype="fulltext">http://eprints.qut.edu.au/65642/</identifier>
</work>
</records>
</zone>
</response>
The URL above needs to increase incrementally s=0, s=1 etc. for 'total' number of times.
'id' needs to be fed into the url below in a second request:
request('http://api.trove.nla.gov.au/work/" +(id)+ "?key=6k6oagt6ott4ohno&reclevel=full', function(error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html, {
xmlMode: true
});
//extract data here etc.
}
});
For example when using id="189231549" returned by the first request the second returned page looks like this
<work id="189231549" url="/work/189231549">
<troveUrl>http://trove.nla.gov.au/work/189231549</troveUrl>
<title>
Design of physiological control and magnetic levitation systems for a total artificial heart
</title>
<contributor>Greatrex, Nicholas Anthony</contributor>
<issued>2014</issued>
<type>Thesis</type>
<subject>Total Artificial Heart</subject>
<subject>Magnetic Levitation</subject>
<subject>Physiological Control</subject>
<abstract>
Total Artificial Hearts are mechanical pumps which can be used to replace the failing natural heart. This novel study developed a means of controlling a new design of pump to reproduce physiological flow bringing closer the realisation of a practical artificial heart. Using a mathematical model of the device, an optimisation algorithm was used to determine the best configuration for the magnetic levitation system of the pump. The prototype device was constructed and tested in a mock circulation loop. A physiological controller was designed to replicate the Frank-Starling like balancing behaviour of the natural heart. The device and controller provided sufficient support for a human patient while also demonstrating good response to various physiological conditions and events. This novel work brings the design of a practical artificial heart closer to realisation.
</abstract>
<language>English</language>
<holdingsCount>1</holdingsCount>
<versionCount>1</versionCount>
<tagCount>0</tagCount>
<commentCount>0</commentCount>
<listCount>0</listCount>
<identifier type="url" linktype="fulltext">http://eprints.qut.edu.au/65642/</identifier>
</work>
So my question is now how do I tie these two parts (loops) together to achieve the result (download and parse about 70000 pages)?
I have no idea how to code this in JavaScript for Node.js. I am new to JavaScript

You can find out how to do it by studying existing famous website copiers (closed source or open source)
For example - use trial copy of http://www.tenmax.com/teleport/pro/home.htm to scrap your pages and then try the same with http://www.httrack.com and you should get the idea how they did it (and how you can do it) quite clearly.
The key programming concepts are lookup cache and task queue
Recursion is not the successful concept here if your solution should scale well up to several node.js worker processes and up to many pages
EDIT: after clarifying comments
Before you start reworking your scrapping engine into more scale-able architecture, as a new Node.js developer you can start simply with synchronized alternative to the Node.js callback hell as provided by the wait.for package created by #lucio-m-tato.
The code below worked for me with the links you provided
var request = require('request');
var cheerio = require('cheerio');
var wait = require("wait.for");
function requestWaitForWrapper(url, callback) {
request(url, function(error, response, html) {
if (error)
callback(error, response);
else if (response.statusCode == 200)
callback(null, html);
else
callback(new Error("Status not 200 OK"), response);
});
}
function readBookInfo(baseUrl, s) {
var html = wait.for(requestWaitForWrapper, baseUrl + '&s=' + s.toString());
var $ = cheerio.load(html, {
xmlMode: true
});
return {
s: s,
id: $('work').attr('id'),
total: parseInt($('records').attr('total'))
};
}
function readWorkInfo(id) {
var html = wait.for(requestWaitForWrapper, 'http://api.trove.nla.gov.au/work/' + id.toString() + '?key=6k6oagt6ott4ohno&reclevel=full');
var $ = cheerio.load(html, {
xmlMode: true
});
return {
title: $('title').text(),
contributor: $('contributor').text()
}
}
function main() {
var baseBookUrl = 'http://api.trove.nla.gov.au/result?key=6k6oagt6ott4ohno&zone=book&l-advformat=Thesis&sortby=dateDesc&q=+date%3A[2000+TO+2014]&l-availability=y&l-australian=y&n=1';
var baseInfo = readBookInfo(baseBookUrl, 0);
for (var s = 0; s < baseInfo.total; s++) {
var bookInfo = readBookInfo(baseBookUrl, s);
var workInfo = readWorkInfo(bookInfo.id);
console.log(bookInfo.id + ";" + workInfo.contributor + ";" + workInfo.title);
}
}
wait.launchFiber(main);

You could use the additional async module to handle multiple request and iteration through several pages. Read more about async here https://github.com/caolan/async.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Crawling with Node.js - node.js

Personally, I use Node IO to scrape some websites. https://github.com/chriso/node.io More details about scraping can be found in the wiki !

Related

Scraping dynamic data of a web page in nodejs

NodeJS readdir() function always being run twice

Getting data pushed to an array outside of a Promise

Zombie.js check dynamic updates

Incremental and non-incremental urls in node js with cheerio and request

Categories

Resources