I am trying to scrape content from a web page that is continuously changing. I have been able to use PhantomJS to achieve this however wanted a lighter weight solution. The following code gets the correct value the first time it prints to the console. However on following iterations the same value is printed. Any ideas?
var Browser = require("zombie");
var assert = require("assert");
// Load the page from localhost
browser = new Browser()
browser.visit("http://www.timeanddate.com/worldclock/usa/los-angeles", function () {
setInterval(function(){
console.log(browser.text('#ct'));
},10000);
});
Note the example above is purely an example. I know this would be the most inefficient way to get the time in Los Angeles.
Once you call browser.visit(), the browser stores the response, but unless you call it multiple times, the response won't change. See it for yourself:
browser.visit("http://www.timeanddate.com/worldclock/usa/los-angeles", function () {
console.log(browser.html()); // will print the HTML to stdout
});
So what you probably want is to call browser.visit() more than once, maybe inside setInterval() (although there may be more robust solutions out there).
I readapted your code:
var Browser = require("zombie");
var assert = require("assert");
var browser = new Browser();
setInterval(function () {
browser.visit("http://www.timeanddate.com/worldclock/usa/los-angeles", function () {
console.log(browser.text('#ct'));
});
}, 10000);
Related
I am a total scrub with the node http module and having some trouble.
The ultimate goal here is to take a huge list of urls, figure out which are valid and then scrape those pages for certain data. So step one is figuring out if a URL is valid and this simple exercise is baffling me.
say we have an array allURLs:
["www.yahoo.com", "www.stackoverflow.com", "www.sdfhksdjfksjdhg.net"]
The goal is to iterate this array, make a get request to each and if a response comes in, add the link to a list of workingURLs (for now just another array), else it goes to a list brokenURLs.
var workingURLs = [];
var brokenURLs = [];
for (var i = 0; i < allURLs.length; i++) {
var url = allURLs[i];
var req = http.get(url, function (res) {
if (res) {
workingURLs.push(?????); // How to derive URL from response?
}
});
req.on('error', function (e) {
brokenURLs.push(e.host);
});
}
what I don't know is how to properly obtain the url from the request/ response object itself, or really how to structure this kind of async code - because again, I am a nodejs scrub :(
For most websites using res.headers.location works, but there are times when the headers do not have this property and that will cause problems for me later on. Also I've tried console logging the response object itself and that was a messy and fruitless endeavor
I have tried pushing the url variable to workingURLs, but by the time any response comes back that would trigger the push, the for loop is already over and url is forever pointing to the final element of the allURLs array.
Thanks to anyone who can help
You need to closure url value to have access to it and protect it from changes on next loop iteration.
For example:
(function(url){
// use url here
})(allUrls[i]);
Most simple solution for this is use forEach instead of for.
allURLs.forEach(function(url){
//....
});
Promisified solution allows you to get a moment when work is done:
var http = require('http');
var allURLs = [
"http://www.yahoo.com/",
"http://www.stackoverflow.com/",
"http://www.sdfhksdjfksjdhg.net/"
];
var workingURLs = [];
var brokenURLs = [];
var promises = allURLs.map(url => validateUrl(url)
.then(res => (res?workingURLs:brokenURLs).push(url)));
Promise.all(promises).then(() => {
console.log(workingURLs, brokenURLs);
});
// ----
function validateUrl(url) {
return new Promise((ok, fail) => {
http.get(url, res => return ok(res.statusCode == 200))
.on('error', e => ok(false));
});
}
// Prevent nodejs from exit, don't need if any server listen.
var t = setTimeout(() => { console.log('Time is over'); }, 1000).ref();
You can use something like this (Not tested):
const arr = ["", "/a", "", ""];
Promise.all(arr.map(fetch)
.then(responses=>responses.filter(res=> res.ok).map(res=>res.url))
.then(workingUrls=>{
console.log(workingUrls);
console.log(arr.filter(url=> workingUrls.indexOf(url) == -1 ))
});
EDITED
Working fiddle (Note that you can't do request to another site in the browser because of Cross domain).
UPDATED with #vp_arth suggestions
const arr = ["/", "/a", "/", "/"];
let working=[], notWorking=[],
find = url=> fetch(url)
.then(res=> res.ok ?
working.push(res.url) && res : notWorking.push(res.url) && res);
Promise.all(arr.map(find))
.then(responses=>{
console.log('woking', working, 'notWorking', notWorking);
/* Do whatever with the responses if needed */
});
Fiddle
hello I am new to Casper and node
I am trying to run a code that scrpping data from a site
but WaitForselector function is not working correctly .
my code is
casper.waitForSelector('.searchAutoSuggstn', function() {
this.echo('Search auto suggestion.'); // this line is printing my console
var data = this.evaluate(function() {
var suggestions = [];
this.echo('Search auto suggestion data.'); //But this line is not printing my console
var element = $('.searchAutoSuggstn .suggestionsList_menu').find('.topProdhead_left').prevAll().filter(function() {
this.echo('omnitrack');
return $(this).data("omnitrack") ;
});
is any boddy can tell me whats the main problem ?
You can't call casper methods such as casper.echo() in casper.evaluate() since evaluate executes code in the context of the browser.
You could use console.log in the browser to output to the javascript console which you can then catch with the hook
casper.on("remote.messsage", function(msg){
// Do something
});
Ok so i am using a method to make a request and pull some tables from another URL
Meteor.methods({
gimmetitle: function () {
var url = 'http://wiki.warthunder.com/index.php?title=B-17G_Flying_Fortress';
request(url, function(err, response, body) {
$ = cheerio.load(body);
var text = $('.flight-parameters td').text();
console.log(text);
return text;
});
}
});
When called the td's in the table succesfully print to the server console: http://prntscr.com/721pjh
Buuut, when that text is returned from that method to this client code, undefined is printed to the console:
Template.title.events({
'click #thebutton': function () {
Meteor.call('gimmetitle', function(error, result){
Session.set('gogle', result);
});
var avar = Session.get('gogle');
console.log(avar);
}
});
Ideas?
You need to understand two different things here :
On the client side, making some calls to the server is always asynchronous, because we have to deal with network latency. That's why we use callbacks to fetch the result of Meteor methods : this code is executed some time in the future, not right away.
This is why Session.set('gogle', result); is actually executed AFTER var avar = Session.get('gogle'); even though it appears before in your event handler code flow.
Contrary to template helpers, event handlers are NOT reactive, so it means that when you set the Session variable to the result of the method, the event handler code is not automatically reexecuted with the new value of Session.get('gogle').
You'll need to either do something with the result right in the Meteor method callback, or use a reactive computation (template helpers or Tracker.autorun) depending on Session.get('gogle') to rerun whenever the reactive data source is modified, and use the new value fetched from the server and assigned to the Session variable.
Quick update..Was able to fix this with just 1 line of code lol.
instead of request(url, function(err, response, body) i used the froatsnook:request package and used var result = request.getSync(url, {encoding: null}); and then just replaced $ = cheerio.load(body); with $ = cheerio.load(result.body);.
I want to write a test to update a blog post (or whatever):
* Insert a blog post in a database
* Get the id the blog post got in MongoDb
* POST an updated version to my endpoint
* After the request have finished: check in the database that update has been done
Here's this, using koa:
var db = require('../lib/db.js');
describe('a test suite', function(){
it('updates an existing text', function (done) {
co(function * () {
var insertedPost = yield db.postCollection.insert({ title : "Title", content : "My awesome content"});
var id = insertedPost._id;
var url = "/post/" + id;
var updatedPost = { content : 'Awesomer content' };
request
.post(url)
.send(updatedTextData)
.expect(302)
.expect('location', url)
.end(function () {
co(function *() {
var p = yield db.postCollection.findById(id);
p.content.should.equal(updatedPost.content);
console.log("CHECKED DB");
})(done());
});
});
});
});
I realize that there's a lot of moving parts in there, but I've tested all the interactions separately. Here's the db-file I've included (which I know works fine since I use it in production):
var monk = require('monk');
var wrap = require('co-monk');
function getCollection(mongoUrl, collectionName) {
var db = monk(mongoUrl);
return wrap(db.get(collectionName));
};
module.exports.postCollection = getCollection([SECRET MONGO CONNECTION], 'posts');
The production code works as intended.
This test passes but it seems, to me, like the co-function in the .end()-clause never is run... but the done() call gets made. No "CHECKED DB" is being printed, at least.
I've tried with "done()" and "done" without. Sometimes that works and sometimes not.
I've tried to move the check of the database outside the request... but that just hangs, since supertest wants us to call done() when we are completed.
All of this leaves me confused and scared (:)) - what am I doing wrong here.
Realising that the question was very long-winding and specific I feared that I would never get a proper answer. Due to the badly asked question.
But the answer given and the comments made me look again and I found it. I wrote a long blog post about it but I'll give away the end of it here as a summary. If it doesn't make sense there's more of the same :) in the blog post.
Here is the TL;DR:
I wanted to check the state of the database after doing a request. This can be done using the .end() function of supertest.
Since I used co-monk I wanted to be able to do that using yield and generators. This means that I need to wrap my generator function with co.
co, since version 4.0.0, returns a promise. This perfect for users of mocha since it allows us to use the .then() function and pass the done variable to both the success and failure functions of .then(fn success, fn failure(err)).
The test in it’s entirety is displayed below. Running this returns the error due to failing assertion, as I want:
var co = require("co");
var should = require("should");
var helpers = require('./testHelpers.js');
var users = helpers.users;
var request = helpers.request;
describe('POST to /user', function(){
var test_user = {};
beforeEach(function (done) {
test_user = helpers.test_user;
helpers.removeAll(done);
});
afterEach(function (done) {
helpers.removeAll(done);
});
it('creates a new user for complete posted data', function(done){
// Post
request
.post('/user')
.send(test_user)
.expect('location', /^\/user\/[0-9a-fA-F]{24}$/) // Mongo Object Id /user/234234523562512512
.expect(201)
.end(function () {
co(function *() {
var userFromDb = yield users.findOne({ name : test_user.name });
userFromDb.name.should.equal("This is not the name you are looking for");
}).then(done, done);
});
});
});
This happens because
var p = yield db.postCollection.findById(id);
is the last line will be executed in your generator function.
You can test whether I am right by adding a console.log('before first yield').
yield is the replacement for return in generator functions, but it runs to the next yield if you call the function a second time.
A generator-function is executed from yield to yield
(best way to explain it the short way - I think).
Your solution:
simple erase the yield before the database find:
var p = db.postCollection.findById(id);
Complete Node.js noob, so dont judge me...
I have a simple requirement. Crawl a web site, find all the product pages, and save some data from the product pages.
Simpler said then done.
Looking at Node.js samples, i cant find something similar.
There a request scraper:
request({uri:'http://www.google.com'}, function (error, response, body) {
if (!error && response.statusCode == 200) {
var window = jsdom.jsdom(body).createWindow();
jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
// jQuery is now loaded on the jsdom window created from 'body'
jQuery('.someClass').each(function () { /* Your custom logic */ });
});
}
});
But i cant figure out how to call it self once it scrapes the root page, or to populate an array or url's that it needs to scrape.
Then there's the http agent way:
var agent = httpAgent.create('www.google.com', ['finance', 'news', 'images']);
agent.addListener('next', function (err, agent) {
var window = jsdom.jsdom(agent.body).createWindow();
jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
// jQuery is now loaded on the jsdom window created from 'agent.body'
jquery('.someClass').each(function () { /* Your Custom Logic */ });
agent.next();
});
});
agent.addListener('stop', function (agent) {
sys.puts('the agent has stopped');
});
agent.start();
Which takes an array of locations, but then again, once you get it started with an array, you cant add more locations to it to go through all the product pages.
And i cant even get Apricot working, for some reason i'm getting an error.
So, how do i modify any of the above examples (or anything not listed above) to scrape a site, find all the product pages, find some data in there (the jquery.someclass example should do the trick) and that save that to a db?
Thanks!
Personally, I use Node IO to scrape some websites. https://github.com/chriso/node.io
More details about scraping can be found in the wiki !
I've had pretty good success crawling and scraping with Casperjs. It's a pretty nice library built on top of Phantomjs. I like it because it's fairly succinct. Callbacks can be executed as foo.then() which is super-simple to understand and I even can use jQuery since Phantomjs is an implementation of webkit. For example, the following would instantiate an instance of Casper and push all links on an archive page to an array called 'links'.
var casper = require("casper").create();
var numberOfLinks = 0;
var currentLink = 0;
var links = [];
var buildPage, capture, selectLink, grabContent, writeContent;
casper.start("http://www.yoursitehere.com/page_to/scrape/", function() {
numberOfLinks = this.evaluate(function() {
return __utils__.findAll('.nav-selector a').length;
});
this.echo(numberOfLinks + " items found");
// cause jquery makes it easier
casper.page.injectJs('/PATH/TO/jquery.js');
});
// Capture links
capture = function() {
links = this.evaluate(function() {
var link = [];
jQuery('.nav-selector a').each(function() {
link.push($(this).attr('href'));
});
return link;
});
this.then(selectLink);
};
You can then use node fs (or whatever else you want, really) to push your data into XML, CSV, or whatever you want. The example for scraping BBC photos was exceptionally helpful when I built my scraper.
This is a view from 10,000 feet of what casper can do. It has a very potent and broad API. I dig it, in case you couldn't tell :).
My full scraping example is here: https://gist.github.com/imjared/5201405.