Node.js Web Scraping with Jsdom - node.js

I would like to scrape the http://www.euromillones.com.es/ website to get the last winning 5 numbers and two stars. It can be seen on the left column of the website. I've been reading tutorials, but I am not capable of achieving this.
This is the code I wrote so far:
app.get('/winnernumbers', function(req, res){
//Tell the request that we want to fetch youtube.com, send the results to a callback function
request({uri: 'http://www.euromillones.com.es/ '}, function(err, response, body){
var self = this;
self.items = new Array();//I feel like I want to save my results in an array
//Just a basic error check
if(err && response.statusCode !== 200){console.log('Request error.');}
//Send the body param as the HTML code we will parse in jsdom
//also tell jsdom to attach jQuery in the scripts and loaded from jQuery.com
jsdom.env({
html: body,
scripts: ['http://code.jquery.com/jquery-1.6.min.js ']
}, function(err, window){
//Use jQuery just as in a regular HTML page
var $ = window.jQuery;
res.send($('title').text());
});
});
});
I am getting the following error:
Must pass a "created", "loaded", "done" option or a callback to jsdom.env.

It looks to me that you've just used a combination of arguments that jsdom does not know how to handle. The documentation shows this signature:
jsdom.env(string, [scripts], [config], callback);
The two middle arguments are optional but you'll note that all possible combinations here start with a string and end with a callback. The documentation mentions one more way to call jsdom.env, and that's by passing a single config argument. What you are doing amounts to:
jsdom.env(config, callback);
which does not correspond to any of the documented methods. I would suggest changing your code to pass a single config argument. You can move your current callback to the done field of that config object. Something like this:
jsdom.env({
html: body,
scripts: ['http://code.jquery.com/jquery-1.6.min.js'],
done: function (err, window) {
//Use jQuery just as in a regular HTML page
var $ = window.jQuery;
res.send($('title').text());
}
});

Related

iterative render with node and express

New to node and async and still struggling with concepts.
Trying to use express/handlebars render with a callback to iteratively build an html body with content from an array. End goal is to send a response with a number of emails each one individually rendered using a view.hbs.
Got this far but realised it was never going to work. res.render can't pass my html variable back in the callback and res.send would run before the renders have completed???
function buildRes (req, res, email) {
var html = '';
Object.keys(email).forEach(function (i) {
res.render('emailPanel', {subject: email[i].subject, body: email[i].body},
function(err, renOut) {
if err throw err;
html=html+renOut;
}
)
})
res.send(html);
}
Any suggestions on how I should be approaching this problem?
Started out trying to use handlebars #each helper to do the iteration but all of the examples show a simple list whereas in my case there a multiple array parameters to be passed to the render.
I'm still not sure what you're trying to accomplish with this, but one is for sure, I think it's better for you to do all looping inside your view by passing the entire array (filtered) with res.render to your view. Also note that you can respond only once per request.

how to publish a page using node.js

I have just begun to learn node.js. Over the last two days, I've been working on a project that accepts userinput and publishes a ICS file. I have all of that working. Now consider when I have to show this data. I get a router.get to see if I am at the /cal page and..
router.get('/cal', function(req, res, next)
{
var db = req.db;
var ical = new icalendar.iCalendar();
db.find({
evauthor: 'mykey'
}, function(err, docs) {
docs.forEach(function(obj) {
var event2 = ical.addComponent('VEVENT');
event2.setSummary(obj.evics.evtitle);
event2.setDate(new Date(obj.evics.evdatestart), new Date(obj.evics.evdateend));
event2.setLocation(obj.evics.evlocation)
//console.log(ical.toString());
});
});
res.send(ical.toString());
// res.render('index', {
// title: 'Cal View'
// })
})
So when /cal is requested, it loops through my db and creates an ICS calendar ical. If I do console.log(ical.toString) within the loop, it gives me a properly formatted calendar following the protocol.
However, I'd like to END the response with this. At the end I do a res.send just to see what gets published on the page. This is what gets published
BEGIN:VCALENDAR VERSION:2.0
PRODID:calendar//EN
END:VCALENDAR
Now the reason is pretty obvious. Its the nature of node.js. The response gets sent to the browser before the callback function finishes adding each individual VEVENT to the calendar object.
I have two related questions:
1) Whats the proper way to "wait" till the callback is done.
2) How
do I use res to send out a .ics dynamic link with
ical.toString() as the content. Do I need to create a new view for
this ?
edit: I guess for number 2 I'd have to set the HTTP headers like so
//set correct content-type-header
header('Content-type: text/calendar; charset=utf-8');
header('Content-Disposition: inline; filename=calendar.ics');
but how do I do this when using views.
Simply send the response, once you got the neccessary data! You are not required to end or send directly in your route but can do it in a nested callback as well:
router.get('/cal', function(req, res, next) {
var db = req.db;
var ical = new icalendar.iCalendar();
db.find({
evauthor: 'mykey'
}, function(err, docs) {
docs.forEach(function(obj) {
var event2 = ical.addComponent('VEVENT');
event2.setSummary(obj.evics.evtitle);
event2.setDate(new Date(obj.evics.evdatestart), new Date(obj.evics.evdateend));
event2.setLocation(obj.evics.evlocation)
});
res.type('ics');
res.send(ical.toString());
});
});
I also included sending the proper Content-Type by using res.type.
Also: Don't forget to add proper error handling. You can for example use res.sendStatus(500) if an error occured while retrieving the documents.

check on server side if youtube video exist

How to check if youtube video exists on node.js app server side:
var youtubeId = "adase268_";
// pseudo code
youtubeVideoExist = function (youtubeId){
return true; // if youtube video exists
}
You don't need to use the youtube API per-se, you can look for the thumbnail image:
Valid video = 200 - OK:
http://img.youtube.com/vi/gC4j-V585Ug/0.jpg
Invalid video = 404 - Not found:
http://img.youtube.com/vi/gC4j-V58xxx/0.jpg
I thought I could make this work from the browser since you can load images from a third-party site without security problems. But testing it, it's failing to report the 404 as an error, probably because the content body is still a valid image. Since you're using node, you should be able to look at the HTTP response code directly.
I can't think of an approach that doesn't involve making a separate HTTP request to the video link to see if it exists or not unless you know beforehand of a set of video IDs that are inactive,dead, or wrong.
Here's an example of something that might work for you. I can't readily tell if you're using this as a standalone script or as part of a web server. The example below assumes the latter, assuming you call a web server on /video?123videoId and have it respond or do something depending on whether or not the video with that ID exists. It uses Node's request library, which you can install with npm install request:
var request = require('request');
// Your route here. Example on what route may look like if called on /video?id=123videoId
app.get('/video', function(req, response, callback){
var videoId = 'adase268_'; // Could change to something like request.params['id']
request.get('https://www.youtube.com/watch?v='+videoId, function(error, response, body){
if(response.statusCode === 404){
// Video doesn't exist. Do what you need to do here.
}
else{
// Video exists.
// Can handle other HTTP response codes here if you like.
}
});
});
// You could refactor the above to take out the 'request.get()', wrap it in a function
// that takes a callback and re-use in multiple routes, depending on your problem.
#rodrigomartell is on the right track, in that your check function will need to make an HTTP call; however, just checking the youtube.com URL won't work in most cases. You'll get back a 404 if the videoID is a malformed ID (i.e. less than 11 characters or using characters not valid in their scheme), but if it's a properly formed videoID that just happens to not correspond to a video, you'll still get back a 200. It would be better to use an API request, like this (note that it might be easier to use the request-json library instead of just the request library):
request = require('request-json');
var client = request.newClient('https://www.googleapis.com/youtube/v3/');
youtubeVideoExist = function (youtubeId){
var apikey ='YOUR_API_KEY'; // register for a javascript API key at the Google Developer's Console ... https://console.developers.google.com/
client.get('videos/?part=id&id='+youtubeId+'&key='+apikey, function(err, res, body) {
if (body.items.length) {
return true; // if youtube video exists
}
else {
return false;
}
});
};
Using youtube-feeds module. Works fast (~200ms) and no need API_KEY
youtube = require("youtube-feeds");
existsFunc = function(youtubeId, callback) {
youtube.video(youtubeId, function(err, result) {
var exists;
exists = result.id === youtubeId;
console.log("youtubeId");
console.log(youtubeId);
console.log("exists");
console.log(exists);
callback (exists);
});
};
var notExistentYoutubeId = "y0srjasdkfjcKC4eY"
existsFunc (notExistentYoutubeId, console.log)
var existentYoutubeId = "y0srjcKC4eY"
existsFunc (existentYoutubeId, console.log)
output:
❯ node /pathToFileWithCodeAbove/FileWithCodeAbove.js
youtubeId
y0srjcKC4eY
exists
true
true
youtubeId
y0srjasdkfjcKC4eY
exists
false
false
All you need is to look for the thumbnail image. In NodeJS it would be something like
var http = require('http');
function isValidYoutubeID(youtubeID) {
var options = {
method: 'HEAD',
host: 'img.youtube.com',
path: '/vi/' + youtubeID + '/0.jpg'
};
var req = http.request(options, function(res) {
if (res.statusCode == 200){
console.log("Valid Youtube ID");
} else {
console.log("Invalid Youtube ID");
}
});
req.end();
}
API_KEY is not needed. It is quite fast because there is only header check for statusCode 200/404 and image is not loaded.

Node.js: Using multiple moustache templates

I'm trying to break a moustache template up into various components so I can reuse them, and get the assembled text returned through node.js. I cannot find anyone that has done this.
I can return must ache pages imply with:
function index(request, response, next) {
var stream = mu.compileAndRender('index.mu',
{name: "Me"}
);
util.pump(stream, response);
}
I just cannot figure out how to render a template and use it in another template. I've tried rendering it separately like this:
function index(request, response, next) {
var headerStream = mu.compileAndRender('header.mu', {title:'Home page'});
var headerText;
headerStream.on('data', function(data) {
headerText = headerText + data.toString();
});
var stream = mu.compileAndRender('index.mu',
{
heading: 'Home Page',
content: 'hello this is the home page',
header: headerText
});
util.pump(stream, response);
}
But the problem is that the header is not rendered before the page and even if I get that to happen. The header is seen as display text rather than html.
Any help appreciated.
You need to put the lines var stream = ... and util.pump(... in headerStream.on('end', function () { /* here */ });, so that they are ran after the headerStream has sent all the data to your data listener.
To include raw HTML you have to use the triple braces: {{{raw}}} in your template as http://mustache.github.com/mustache.5.html states.
The moustache guys came back with the answer. I can do it by using partials like this:
{{> index.mu}}
Inside the moustache template rather than trying to do it in node.js.

jsdom form submission?

I'm trying to use the Node.js packages request and jsdom to scrape web pages, and I want to know how I can submit forms and get their responses. I'm not sure if this is possible with jsdom or another module, but I do know that request supports cookies.
The following code demonstrates how I'm using jsdom (along with request and jQuery) to retrieve and parse a web page (in this case, the Wikipedia home page). (Note that this code is adapted from the jquery-request.js code from this tutorial http://blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs)
var request = require('request'),
jsdom = require('jsdom'),
url = 'http://www.wikipedia.org';
request({ uri:url }, function (error, response, body) {
if (error && response.statusCode !== 200) {
console.log('Error when contacting '+url);
}
jsdom.env({
html: body,
scripts: [
'http://code.jquery.com/jquery-1.5.min.js'
]
}, function (err, window) {
var $ = window.jQuery,
// jQuery is now loaded on the jsdom window created from 'agent.body'
$searchform = $('#searchform'); //search form jQuery object
$('#searchInput').val('Wood');
console.log('form HTML is ' + $searchform.html(),
'search value is ' + $('#searchInput').val()
//how I'd like to submit the search form
$('#searchform .searchButton').click();
);
});
});
The above code prints the HTML from Wikipedia's search form, then "Wood", the value I set the searchInput field to contain. Of course, here the click() method doesn't really do anything, because jQuery isn't operating in a browser; I don't even know if jsdom supports any kind of event handling.
Is there any module that can help me to interact with web pages in this way, or in a similar non-jQuery way? Can this be done in jsdom?
Thanks in advance!
If you don't want to handle the POST request yourself like in the other answer, you can use an alternative to jsdom that does support more things in a browser.
http://www.phantomjs.org/
I'm not familiar with a nodejs library that will let you get a fully interactive client-side view of a web-page, but you can get the results of a form submission without too much worry.
HTML forms are essentially just a way of sending HTTP requests to a specific URL (which can be found as the action attribute of the form tag). With access to the DOM, you can just pull out these values and create your own request for the specified URL.
Something like this as the callback from requesting the wikipedia home page will get you the result of doing a search for "keyboard cat" in english:
var $ = window.jQuery;
var search_term = "keyboard cat";
var search_term_safe = encodeURIComponent(search_term).replace("%20", "+");
var lang = "en";
var lang_safe = encodeURIComponent(lang).replace("%20", "+");
var search_submit_url = $("#searchform").attr("action");
var search_input_name = $("#searchInput").attr("name");
var search_language_name = $("#language").attr("name");
var search_string = search_input_name + "=" + search_term_safe + "&" + search_language_name + "=" + lang_safe;
// Note the wikipedia specific hack by prepending "http:".
var full_search_uri = "http:" + search_submit_url + "?" + search_string;
request({ uri: full_search_uri }, function(error, response) {
if (error && response.statusCode != 200) {
console.log("Got an error from the search page: " + error);
} else {
// Do some stuff with the response page here.
}
});
Basically the important stuff is:
"Submitting a search" really just means sending either a HTTP GET or POST request to the URL specified at the action attribute of the form tag.
Create the string to use for form submission using the name attributes of each of the form's input tags, combined with the value that they are actually submitting, in this format: name1=value1&name2=value2
For GET requests, just append that string to the URL as a query string (URL?query-string)
For POST requests, post that string as the body of the request.
Note that the string used for form submission must be escaped and have spaces represented as +.

Resources