I am trying to take a variable from the original url, insert it into a second url, visit that url and then access variable from it. I have two problems:
Problem 1: 'myurl' variable returns value of
"http://api.trove.nla.gov.au/work/undefined?key=6k6oagt6ott4ohno&reclevel=full"
That is, it is not taking the 'myid' variable.
Problem 2: How do I then follow the 'myurl' url as I want to access the DOM? Do I make another 'request' for 'myurl'?
Here is my code so far:
var request = require('request'),
cheerio = require('cheerio');
request('http://api.trove.nla.gov.au/result?key=6k6oagt6ott4ohno&zone=book&l-advformat=Thesis&sortby=dateDesc&q=+date%3A[2000+TO+2014]&l-availability=y&l-australian=y&n=0&s=0', function(error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html, {
xmlMode: true
});
var myid = ($('work').attr('id'))
var myurl = "http://api.trove.nla.gov.au/work/" +(myid)+ "?key=6k6oagt6ott4ohno&reclevel=full"
console.log(myurl)
}
});
Your selector is probably wrong, did you mean '#work' or '.work' perhaps?
Yes, you'll have to make another request to myurl using request() (since that's what you're using)
Related
var request = require('request'),
cheerio = require('cheerio');
var i
for(i=1; i<908; i++){
var options = {
url:k,
var address='http://gallog.dcinside.com/inc/_mylog.php?
gid=chermy018&oneview=Y&cid=59&page=';
var k = address+'i';
request(options, function (err, response, body) {
console.log(body);
});
}
error
i selected parsing method by using cookie. if i run this code for one page, it work well, but for multiple pages it result sequence of weird symbols or letters like above. what cause this and how can i correct error? and because i only searched and modified nodejs codes for parsing, i don't know much well about cheerio. i want to parse texts but i don't know how designate under 'id' part.
I need to make a simple web scraper to grab some basic info about the Athens Stock Exchange in real time. My weapon of choice is Node.js and more specifically the 'cheerio' module.
The info I want to grab is represented in the website as the text inside some elements. These elements are nested inside another one. An example is this:
<span id="tickerGeneralIndex" class="style3red">
<span class="percentagedelta">
-0,50%
</span>
</span>
In this case, the data I want to extract is '-0,50%'.
The code I have written is this:
var request = require('request'),
cheerio = require('cheerio');
request('http://www.euro2day.gr/AseRealTime.aspx', function (error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html);
var span = $('span.percentagedelta').text();
console.log(span);
}
});
This code does not produce the desired output. When run it logs a single empty line in the console.
I have tried to modify my code like this for testing purposes:
var request = require('request'),
cheerio = require('cheerio');
request('http://www.euro2day.gr/AseRealTime.aspx', function (error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html);
var span = $('span.percentagedelta').attr('class');
console.log(span);
}
});
This way I get 'percentagedelta' in the console. This is correct, as I have asked to get the class of the element. Of course this is not what I wanted. I merely did this to find out if the 'span' variable is loaded correctly.
I am beginning to suspect this has something to do with the characters in the text. Is it possible that some encoding issue is to blame? And if yes, how can I fix that?
The original html of http://www.euro2day.gr/AseRealTime.aspx has no data in 'percentagedelta'
You can look throw you html variable.
Data is setting synchronically by javascript on the page
$("#tickerGeneralIndex .percentagedelta").html(data.percentageDelta);
Maybe it would be more simple to fetch http://www.euro2day.gr/handlers/data.ashx?type=3 that page loads with ajax
I'm experimenting with Node.js and web scraping. In this case, I'm trying to scrape the most recent songs from a local radio station for display. With this particular website, body returns nothing. When I try using google or any other website, body has a value.
Is this a feature of the website I'm trying to scrape?
Here's my code:
var request = require('request');
var url = "http://www.radiomilwaukee.org";
request(url, function(err,resp,body) {
if (!err && resp.statusCode == 200) {
console.log(body);
}
else
{
console.log(err);
}
});
That's weird, the website you're requesting doesn't seem to return anything unless the accept-encoding header is set to gzip. With that in mind, using this gist will work: https://gist.github.com/nickfishman/5515364
I ran the code within that gist, replacing the URL with "http://www.radiomilwaukee.org" and see the content within the sample.html file once the code has completed.
If you'd rather have access to the web page's content within the code, you could do something like this:
// ...
req.on('response', function(res) {
var body, encoding, unzipped;
if (res.statusCode !== 200) throw new Error('Status not 200');
encoding = res.headers['content-encoding'];
if (encoding == 'gzip') {
unzipped = res.pipe(zlib.createGunzip());
unzipped.on("readable", function() {
// collect the content in the body variable
body += unzipped.read().toString();
});
}
// ...
I am trying to get all of the links in a subreddit using the API, but it is only returning one url. Here is the code I have:
var request = require('request');
webpage = 'http://www.reddit.com/r/AmazonUnder5/top.json?limit=100';
//login
request.post('http://www.reddit.com/api/login',{form:{api_type:'json', passwd:'password', rem:true, user:'username'}});
//get urls
request({uri : webpage, json:true, headers:{useragent: 'mybot v. 0.0.1'}}, function(error, response, body) {
if(!error && response.statusCode == 200) {
for(var key in body.data.children) {
var url = body.data.children[key].data.url;
console.log(url);
}
}
});
When I visit the json link in my browser, it returns all 100 posts.
Thats because only 1 exists in the top
http://www.reddit.com/r/AmazonUnder5/top
You could use hot instead
http://www.reddit.com/r/AmazonUnder5/hot.json
Also, you don't need to log in to do public get requests
Edit: You are getting so few results because you are not logged in properly
When logging in, use the
"op" => "login"
Parameter and test what cookies and data is returned.
I also recommend using the ssl login url since that works for me
https://ssl.reddit.com/api/login/
I'm trying to use the Node.js packages request and jsdom to scrape web pages, and I want to know how I can submit forms and get their responses. I'm not sure if this is possible with jsdom or another module, but I do know that request supports cookies.
The following code demonstrates how I'm using jsdom (along with request and jQuery) to retrieve and parse a web page (in this case, the Wikipedia home page). (Note that this code is adapted from the jquery-request.js code from this tutorial http://blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs)
var request = require('request'),
jsdom = require('jsdom'),
url = 'http://www.wikipedia.org';
request({ uri:url }, function (error, response, body) {
if (error && response.statusCode !== 200) {
console.log('Error when contacting '+url);
}
jsdom.env({
html: body,
scripts: [
'http://code.jquery.com/jquery-1.5.min.js'
]
}, function (err, window) {
var $ = window.jQuery,
// jQuery is now loaded on the jsdom window created from 'agent.body'
$searchform = $('#searchform'); //search form jQuery object
$('#searchInput').val('Wood');
console.log('form HTML is ' + $searchform.html(),
'search value is ' + $('#searchInput').val()
//how I'd like to submit the search form
$('#searchform .searchButton').click();
);
});
});
The above code prints the HTML from Wikipedia's search form, then "Wood", the value I set the searchInput field to contain. Of course, here the click() method doesn't really do anything, because jQuery isn't operating in a browser; I don't even know if jsdom supports any kind of event handling.
Is there any module that can help me to interact with web pages in this way, or in a similar non-jQuery way? Can this be done in jsdom?
Thanks in advance!
If you don't want to handle the POST request yourself like in the other answer, you can use an alternative to jsdom that does support more things in a browser.
http://www.phantomjs.org/
I'm not familiar with a nodejs library that will let you get a fully interactive client-side view of a web-page, but you can get the results of a form submission without too much worry.
HTML forms are essentially just a way of sending HTTP requests to a specific URL (which can be found as the action attribute of the form tag). With access to the DOM, you can just pull out these values and create your own request for the specified URL.
Something like this as the callback from requesting the wikipedia home page will get you the result of doing a search for "keyboard cat" in english:
var $ = window.jQuery;
var search_term = "keyboard cat";
var search_term_safe = encodeURIComponent(search_term).replace("%20", "+");
var lang = "en";
var lang_safe = encodeURIComponent(lang).replace("%20", "+");
var search_submit_url = $("#searchform").attr("action");
var search_input_name = $("#searchInput").attr("name");
var search_language_name = $("#language").attr("name");
var search_string = search_input_name + "=" + search_term_safe + "&" + search_language_name + "=" + lang_safe;
// Note the wikipedia specific hack by prepending "http:".
var full_search_uri = "http:" + search_submit_url + "?" + search_string;
request({ uri: full_search_uri }, function(error, response) {
if (error && response.statusCode != 200) {
console.log("Got an error from the search page: " + error);
} else {
// Do some stuff with the response page here.
}
});
Basically the important stuff is:
"Submitting a search" really just means sending either a HTTP GET or POST request to the URL specified at the action attribute of the form tag.
Create the string to use for form submission using the name attributes of each of the form's input tags, combined with the value that they are actually submitting, in this format: name1=value1&name2=value2
For GET requests, just append that string to the URL as a query string (URL?query-string)
For POST requests, post that string as the body of the request.
Note that the string used for form submission must be escaped and have spaces represented as +.