I'm trying to use the Node.js packages request and jsdom to scrape web pages, and I want to know how I can submit forms and get their responses. I'm not sure if this is possible with jsdom or another module, but I do know that request supports cookies.
The following code demonstrates how I'm using jsdom (along with request and jQuery) to retrieve and parse a web page (in this case, the Wikipedia home page). (Note that this code is adapted from the jquery-request.js code from this tutorial http://blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs)
var request = require('request'),
jsdom = require('jsdom'),
url = 'http://www.wikipedia.org';
request({ uri:url }, function (error, response, body) {
if (error && response.statusCode !== 200) {
console.log('Error when contacting '+url);
}
jsdom.env({
html: body,
scripts: [
'http://code.jquery.com/jquery-1.5.min.js'
]
}, function (err, window) {
var $ = window.jQuery,
// jQuery is now loaded on the jsdom window created from 'agent.body'
$searchform = $('#searchform'); //search form jQuery object
$('#searchInput').val('Wood');
console.log('form HTML is ' + $searchform.html(),
'search value is ' + $('#searchInput').val()
//how I'd like to submit the search form
$('#searchform .searchButton').click();
);
});
});
The above code prints the HTML from Wikipedia's search form, then "Wood", the value I set the searchInput field to contain. Of course, here the click() method doesn't really do anything, because jQuery isn't operating in a browser; I don't even know if jsdom supports any kind of event handling.
Is there any module that can help me to interact with web pages in this way, or in a similar non-jQuery way? Can this be done in jsdom?
Thanks in advance!
If you don't want to handle the POST request yourself like in the other answer, you can use an alternative to jsdom that does support more things in a browser.
http://www.phantomjs.org/
I'm not familiar with a nodejs library that will let you get a fully interactive client-side view of a web-page, but you can get the results of a form submission without too much worry.
HTML forms are essentially just a way of sending HTTP requests to a specific URL (which can be found as the action attribute of the form tag). With access to the DOM, you can just pull out these values and create your own request for the specified URL.
Something like this as the callback from requesting the wikipedia home page will get you the result of doing a search for "keyboard cat" in english:
var $ = window.jQuery;
var search_term = "keyboard cat";
var search_term_safe = encodeURIComponent(search_term).replace("%20", "+");
var lang = "en";
var lang_safe = encodeURIComponent(lang).replace("%20", "+");
var search_submit_url = $("#searchform").attr("action");
var search_input_name = $("#searchInput").attr("name");
var search_language_name = $("#language").attr("name");
var search_string = search_input_name + "=" + search_term_safe + "&" + search_language_name + "=" + lang_safe;
// Note the wikipedia specific hack by prepending "http:".
var full_search_uri = "http:" + search_submit_url + "?" + search_string;
request({ uri: full_search_uri }, function(error, response) {
if (error && response.statusCode != 200) {
console.log("Got an error from the search page: " + error);
} else {
// Do some stuff with the response page here.
}
});
Basically the important stuff is:
"Submitting a search" really just means sending either a HTTP GET or POST request to the URL specified at the action attribute of the form tag.
Create the string to use for form submission using the name attributes of each of the form's input tags, combined with the value that they are actually submitting, in this format: name1=value1&name2=value2
For GET requests, just append that string to the URL as a query string (URL?query-string)
For POST requests, post that string as the body of the request.
Note that the string used for form submission must be escaped and have spaces represented as +.
Related
I'm trying to make a click from server side.
I'm using nodeJS and I'm not able to use JQuery function.
I would make click on the .next class.
This is what I would do :
while (nbrPage > 0)
{
//my scraping code
nbrPage--;
$('.next').click();
}
Note than the html code to scrape is like this :
<span class="next">
<a id="nextPage-159c6fa8635" class="page" href="/blablabla"></a>
</span>
Does anyone know how to use JQuery methods in NodeJS code or how to make click function in NodeJS ?
EDIT: I'm scraping a website and I want to loop on each pagination and scrap my data from each page. For this I need to go on the next page and click on the html code below. In other words I would use JQuery functions like $('.next').click() in my node js code (using request and cheerio).
Note than I don't want to handle the click event, I'm looking to make the click.
Thanks for your help
Cheerio is a pretty useful tool which allows you to utilize jQuery within Node.JS. You can find more information over at - https://github.com/cheeriojs/cheerio
Request is designed to be the simplest way possible to make http
calls. It supports HTTPS and follows redirects by default.
Check out their documentation - https://github.com/request/request
For server-side, you need to create a function to find the a href with the id that started with "nextPage-". Then IF found you will need to get the value of the attribute href.
From there you would pass that value back to your "request" script, which I assume you already have and continue your scrapping till a "nextPage-" could not be located anymore.
That repetitive sequence of a function calling itself is called "recursion".
Now for what that might look like in code -
// Load Dependencies
const CHEERIO = require("cheerio");
const REQUEST = require("request");
/**
* Scraps HTML to find next page URL
*
* #function getNextPageUrl
*
* #param {string} HTML
*
* #returns {string || boolean} Returns URL or False
*/
function getNextPageUrl(HTML) {
// Load in scrapped html
let $ = CHEERIO.load(HTML);
// Find ID that starts with `nextPage-`
let nextPage = $("span[id^='nextPage-']:first");
// If it is 0, its false
if(nextPage.length) {
// Return href attribute value
return nextPage.attr("href");
} else {
// Nothing found, return false
return false;
}
}
/**
* Scraps the HTML from pages
*
* #function scrapper
*
* #param {string} URL
*
* #returns {string || boolean} Returns URL or False
*/
function scrapper(URL) {
// Check if URL was provided
if(!URL) {
return fasle;
}
// Send out request to URL
REQUEST(URL, function(error, response, body) {
// Check for errors
if(!error && response.statusCode == 200) {
console.log(body) // Show the HTML
// Recursion
let URL = getNextPageURL(body);
scrapper(URL);
} else {
return false;
}
});
}
// Pass to scrapper function test
//console.log(getNextPageURL("<span class='next'><a id='nextPage-159c6fa8635' class='page' href='/blablabla'></a></span>"));
// Start the initial scrapping
scrapper("http://google.com");
It's impossible to do it in Node.js. Node.js is server side, not client side.
As a solution, you can parse href at the link and make a request to scrap the next page. This is how the server-side scrappers usually work.
I would like to scrape the http://www.euromillones.com.es/ website to get the last winning 5 numbers and two stars. It can be seen on the left column of the website. I've been reading tutorials, but I am not capable of achieving this.
This is the code I wrote so far:
app.get('/winnernumbers', function(req, res){
//Tell the request that we want to fetch youtube.com, send the results to a callback function
request({uri: 'http://www.euromillones.com.es/ '}, function(err, response, body){
var self = this;
self.items = new Array();//I feel like I want to save my results in an array
//Just a basic error check
if(err && response.statusCode !== 200){console.log('Request error.');}
//Send the body param as the HTML code we will parse in jsdom
//also tell jsdom to attach jQuery in the scripts and loaded from jQuery.com
jsdom.env({
html: body,
scripts: ['http://code.jquery.com/jquery-1.6.min.js ']
}, function(err, window){
//Use jQuery just as in a regular HTML page
var $ = window.jQuery;
res.send($('title').text());
});
});
});
I am getting the following error:
Must pass a "created", "loaded", "done" option or a callback to jsdom.env.
It looks to me that you've just used a combination of arguments that jsdom does not know how to handle. The documentation shows this signature:
jsdom.env(string, [scripts], [config], callback);
The two middle arguments are optional but you'll note that all possible combinations here start with a string and end with a callback. The documentation mentions one more way to call jsdom.env, and that's by passing a single config argument. What you are doing amounts to:
jsdom.env(config, callback);
which does not correspond to any of the documented methods. I would suggest changing your code to pass a single config argument. You can move your current callback to the done field of that config object. Something like this:
jsdom.env({
html: body,
scripts: ['http://code.jquery.com/jquery-1.6.min.js'],
done: function (err, window) {
//Use jQuery just as in a regular HTML page
var $ = window.jQuery;
res.send($('title').text());
}
});
I am trying to get all of the links in a subreddit using the API, but it is only returning one url. Here is the code I have:
var request = require('request');
webpage = 'http://www.reddit.com/r/AmazonUnder5/top.json?limit=100';
//login
request.post('http://www.reddit.com/api/login',{form:{api_type:'json', passwd:'password', rem:true, user:'username'}});
//get urls
request({uri : webpage, json:true, headers:{useragent: 'mybot v. 0.0.1'}}, function(error, response, body) {
if(!error && response.statusCode == 200) {
for(var key in body.data.children) {
var url = body.data.children[key].data.url;
console.log(url);
}
}
});
When I visit the json link in my browser, it returns all 100 posts.
Thats because only 1 exists in the top
http://www.reddit.com/r/AmazonUnder5/top
You could use hot instead
http://www.reddit.com/r/AmazonUnder5/hot.json
Also, you don't need to log in to do public get requests
Edit: You are getting so few results because you are not logged in properly
When logging in, use the
"op" => "login"
Parameter and test what cookies and data is returned.
I also recommend using the ssl login url since that works for me
https://ssl.reddit.com/api/login/
Right now, I am doing some simple web scraping, for example get the current train arrival/departure information for one railway station. Here is the example link, http://www.thetrainline.com/Live/arrivals/chester, from this link you can visit the current arrival trains in the chester station.
I am using the node.js request module to do some simple web scraping,
app.get('/railway/arrival', function (req, res, next) {
console.log("/railway/arrival/ "+req.query["city"]);
var city = req.query["city"];
if(typeof city == undefined || city == undefined) { console.log("if it undefined"); city ="liverpool-james-street";}
getRailwayArrival(city,
function(err,data){
res.send(data);
}
);
});
function getRailwayArrival(station,callback){
request({
uri: "http://www.thetrainline.com/Live/arrivals/"+station,
}, function(error, response, body) {
var $ = cheerio.load(body);
var a = new Array();
$(".results-contents li a").each(function() {
var link = $(this);
//var href = link.attr("href");
var due = $(this).find('.due').text().replace(/(\r\n|\n|\r|\t)/gm,"");
var destination = $(this).find('.destination').text().replace(/(\r\n|\n|\r|\t)/gm,"");
var on_time = $(this).find('.on-time-yes .on-time').text().replace(/(\r\n|\n|\r|\t)/gm,"");
if(on_time == undefined) var on_time_no = $(this).find('.on-time-no').text().replace(/(\r\n|\n|\r|\t)/gm,"");
var platform = $(this).find('.platform').text().replace(/(\r\n|\n|\r|\t)/gm,"");
var obj = new Object();
obj.due = due;obj.destination = destination; obj.on_time = on_time; obj.platform = platform;
a.push(obj);
console.log("arrival ".green+due+" "+destination+" "+on_time+" "+platform+" "+on_time_no);
});
console.log("get station data "+a.length +" "+ $(".updated-time").text());
callback(null,a);
});
}
The code works by giving me a list of data, however these data are different from the data seen in the browser, though the data come from the same url. I don't know why it is like that. is it because that their server can distinguish the requests sent from server and browser, that if the request is from server, so they sent me the wrong data. How can I overcome this problem ?
thanks in advance.
They must have stored session per click event. Means if u visit that page first time, it will store session and validate that session for next action you perform. Say, u select some value from drop down list. for that click again new value of session is generated that will load data for ur selected combobox value. then u click on show list then that previous session value is validated and you get accurate data.
Now see, if you not catch that session value programatically and not pass as parameter with that request, you will get default loaded data or not get any thing. So, its chalenging for you to chatch that data.Use firebug for help.
Another issue here could be that the generated content occurs through JavaScript run on your machine. jsdom is a module which will provide such content but is not as lightweight.
Cheerio does not execute these scripts and as a result content may not be visible (as you're experiencing). This is an article I read a while back and caused me to have the same discovery, just open the article and search for "jsdom is more powerful" for a quick answer:
http://encosia.com/cheerio-faster-windows-friendly-alternative-jsdom/
How can an Omnibox extension create and post form data to a website and then display the result?
Here's an example of what I want to do. When you type lookup bieber into the Omnibox, I want my extension to post form data looking like
searchtype: all
searchterm: bieber
searchcount: 20
to the URL http://lookup.com/search
So that the browser will end up loading http://lookup.com/search with the results of the search.
This would be trivial if I could send the data in a GET, but lookup.com expects an HTTP POST. The only way I can think of is to inject a form into the current page and then submit it, but (a) that only works if there is a current page, and (b) it doesn't seem to work anyway (maybe permissions need to be set).
Before going off down that route, I figured that somebody else must at least have tried to do this before. Have you?
You could do this by using the omnibox api:
chrome.omnibox.onInputChanged.addListener(
function(text, suggest) {
doYourLogic...
});
Once you have you extension 'activated' due to a certain keyword you typed you can call something like this:
var q = the params you wish to pass
var url = "http://yourSite.com";
var req = new XMLHttpRequest();
req.open("POST", url, true);
req.setRequestHeader("Content-type", "application/x-www-form-urlencoded");
req.onreadystatechange = function() {
if (req.readyState == 4) {
callback(req.responseXML);
}
}
req.send(q);