So i have this node.js script which scrape some parts of webpage:
var cheerio = require('cheerio');
var request = require('request');
var x = 1;
request({
method: 'GET',
url: 'https://balticnews.net/'
}, function(err, response, body) {
if (err) return console.error(err);
$ = cheerio.load(body);
$('#table, td').eq(x).each(function() {
console.log($(this).text());
});
});
but i need that x would change. I tried to make a for loop but nothing changed. I need that when i run this program it would show me resuslts of x=1 then 1+5 after that 6+5 and so on and on its hard to explain :D Ofcourse i could just copy and paste this lots of times and choose numbers i need :
$('#table, td').eq(x).each(function() {
console.log($(this).text());
});
but i want to learn how to do it faster
So I understand you want just indexes : 1,6,11 ..probably a solution could be :
//Not tested
$('#table, td').each(function(index,element) {
if(index%5==1){
element.each(function(){
console.log($(this).text());
})
}
});
What about a more generic solution in case of more complex case ?
(I'm in XML but same comment would apply to HTML inputs)
matchSeatIndex = $("mySeatList")
.not(':has(SEAT > LIST_CHARACTERISTIC:contains("1"))')
.has("SEAT > LIST_CHARACTERISTIC:contains('1W')")
.has("SEAT > STATUS:contains('AVAILABLE')")
.find('INDEX').first().text();
The problem here is that the first filter (on characteristic containing "1"), will also filter out the "1W".
In such a case, it is painful to write it in 2 parts
matchSeatIndex = $("mySeatList")
.has("SEAT > LIST_CHARACTERISTIC:contains('1W')")
.has("SEAT > STATUS:contains('AVAILABLE')")
.find('INDEX').first().text();
//Then a second part to check with a function if the characteristic '1' is well present
Not sure why cheerio haven't implemented the :eq() there... Sounds basics.
Is there any trick doing the job ?
Related
Node-red node for integrating with an older ventilation system, using screen scraping, nodejs with cheerio. Works fine for fetching some values now, but I seem unable to fetch the right element in the more complex structured telling which operating mode is active. Screenshot of structure attached. And yes, never used jquery and quite a newbie on cheerio.
I have managed, way to complex, to get the value, if it is within a certain part of the tree.
const msgResult = scraped('.control-1');
const activeMode = msgResult.get(0).children.find(x => x.attribs['data-selected'] === '1').attribs['id'];
But only works on first match, fails if the data-selected === 1 isn't in that part of the tree. Thought I should be able to use just .find from the top of the tree, but no matches.
const activeMode = scraped('.control-1').find(x => x.attribs['data-selected'] === '1')
What I would like to get from the html structure attached, is the ID of the div that has data-selected=1, which again can be below any of the two divs of class control-1. Maybe also the content of the underlying span, where the mode is described in text.
HTML structure
It's hard to tell what you're looking for but maybe:
$('.control-1 [data-selected="1"]').attr('id')
You should try to make some loop to check every tree.
Try this code, hope this works.
const cheerio = require ('cheerio')
const fsextra = require ('fs-extra');
(async () => {
try {
const parseFile = async (error, contentHTML) => {
let $ = await cheerio.load (contentHTML)
const selector = $('.control-1 [data-selected="1"]')
for (let num = 0; num < selector.length; num++){
console.log ( selector[num].attribs.id )
}
}
let activeMode = await fsextra.readFile('untitled.html', 'utf-8', parseFile )
} catch (error) {
console.log ('ERROR: ', error)
}
})()
So I'm building a simple wrapper around an API to fetch all results of a particular entity. The API method can only return up to 500 results at a time, but it's possible to retrieve all results using the skip parameter, which can be used to specify what index to start retrieving results from. The API also has a method which returns the number of results there are that exist in total.
I've spent some time battling using the request package, trying to come up with a way to concatenate all the results in order, then execute a callback which passes all the results through.
This is my code currently:
Donedone.prototype.getAllActiveIssues = function(callback){
var url = this.url;
request(url + `/issues/all_active.json?take=500`, function (error, response, body) {
if (!error && response.statusCode == 200) {
var data = JSON.parse(body);
var totalIssues = data.total_issues;
var issues = [];
for (let i=0; i < totalIssues; i+=500){
request(url + `/issues/all_active.json?skip=${i}&take=500`, function (error, response, body){
if (!error && response.statusCode == 200) {
console.log(JSON.parse(body).issues.length);
issues.concat(JSON.parse(body).issues);
console.log(issues); // returns [] on all occasions
//callback(issues);
} else{
console.log("AGHR");
}
});
}
} else {
console.log("ERROR IN GET ALL ACTIVE ISSUES");
}
});
};
So I'm starting off with an empty array, issues. I iterate through a for loop, each time increasing i by 500 and passing that as the skip param. As you can see, I'm logging the length of how many issues each response contains before concatenating them with the main issues variable.
The output, from a total of 869 results is this:
369
[]
500
[]
Why is my issues variable empty when I log it out? There are clearly results to concatenate with it.
A more general question: is this approach the best way to go about what I'm trying to achieve? I figured that even if my code did work, the nature of asynchronicity means it's entirely possible for the results to be concatenated in the wrong order.
Should I just use a synchronous request library?
Why is my issues variable empty when I log it out? There are clearly
results to concatenate with it.
A main problem here is that .concat() returns a new array. It doesn't add items onto the existing array.
You can change this:
issues.concat(JSON.parse(body).issues);
to this:
issues = issues.concat(JSON.parse(body).issues);
to make sure you are retaining the new concatenated array. This is a very common mistake.
You also potentially have sequencing issues in your array because you are running a for loop which is starting a whole bunch of requests at the same time and results may or may not arrive back in the proper order. You will still get the proper total number of issues, but they may not be in the order requested. I don't know if that is a problem for you or not. If that is a problem, we can also suggest a fix for that.
A more general question: is this approach the best way to go about
what I'm trying to achieve? I figured that even if my code did work,
the nature of asynchronicity means it's entirely possible for the
results to be concatenated in the wrong order.
Except for the ordering issue which can also be fixed, this is a reasonable way to do things. We would have to know more about your API to know if this is the most efficient way to use the API to get your results. Usually, you want to avoid making N repeated API calls to the same server and you'd rather make one API call to get all the results.
Should I just use a synchronous request library?
Absolutely not. node.js requires learning how to do asynchronous programming. It is a learning step for most people, but is how you get the best performance from node.js and should be learned and used.
Here's a way to collect all the results in reliable order using promises for synchronization and error propagation (which is hugely useful for async processing in node.js):
// promisify the request() function so it returns a promise
// whose fulfilled value is the request result
function requestP(url) {
return new Promise(function(resolve, reject) {
request(url, function(err, response, body) {
if (err || response.statusCode !== 200) {
reject({err: err, response: response});
} else {
resolve({response: response, body: body});
}
});
});
}
Donedone.prototype.getAllActiveIssues = function() {
var url = this.url;
return requestP(url + `/issues/all_active.json?take=500`).then(function(results) {
var data = JSON.parse(results.body);
var totalIssues = data.total_issues;
var promises = [];
for (let i = 0; i < totalIssues; i+= 500) {
promises.push(requestP(url + `/issues/all_active.json?skip=${i}&take=500`).then(function(results) {
return JSON.parse(results.body).issues;
}));
}
return Promise.all(promises).then(function(results) {
// results is an array of each chunk (which is itself an array) so we have an array of arrays
// now concat all results in order
return Array.prototype.concat.apply([], results);
})
});
}
xxx.getAllActiveIssues().then(function(issues) {
// process issues here
}, function(err) {
// process error here
})
I am a total scrub with the node http module and having some trouble.
The ultimate goal here is to take a huge list of urls, figure out which are valid and then scrape those pages for certain data. So step one is figuring out if a URL is valid and this simple exercise is baffling me.
say we have an array allURLs:
["www.yahoo.com", "www.stackoverflow.com", "www.sdfhksdjfksjdhg.net"]
The goal is to iterate this array, make a get request to each and if a response comes in, add the link to a list of workingURLs (for now just another array), else it goes to a list brokenURLs.
var workingURLs = [];
var brokenURLs = [];
for (var i = 0; i < allURLs.length; i++) {
var url = allURLs[i];
var req = http.get(url, function (res) {
if (res) {
workingURLs.push(?????); // How to derive URL from response?
}
});
req.on('error', function (e) {
brokenURLs.push(e.host);
});
}
what I don't know is how to properly obtain the url from the request/ response object itself, or really how to structure this kind of async code - because again, I am a nodejs scrub :(
For most websites using res.headers.location works, but there are times when the headers do not have this property and that will cause problems for me later on. Also I've tried console logging the response object itself and that was a messy and fruitless endeavor
I have tried pushing the url variable to workingURLs, but by the time any response comes back that would trigger the push, the for loop is already over and url is forever pointing to the final element of the allURLs array.
Thanks to anyone who can help
You need to closure url value to have access to it and protect it from changes on next loop iteration.
For example:
(function(url){
// use url here
})(allUrls[i]);
Most simple solution for this is use forEach instead of for.
allURLs.forEach(function(url){
//....
});
Promisified solution allows you to get a moment when work is done:
var http = require('http');
var allURLs = [
"http://www.yahoo.com/",
"http://www.stackoverflow.com/",
"http://www.sdfhksdjfksjdhg.net/"
];
var workingURLs = [];
var brokenURLs = [];
var promises = allURLs.map(url => validateUrl(url)
.then(res => (res?workingURLs:brokenURLs).push(url)));
Promise.all(promises).then(() => {
console.log(workingURLs, brokenURLs);
});
// ----
function validateUrl(url) {
return new Promise((ok, fail) => {
http.get(url, res => return ok(res.statusCode == 200))
.on('error', e => ok(false));
});
}
// Prevent nodejs from exit, don't need if any server listen.
var t = setTimeout(() => { console.log('Time is over'); }, 1000).ref();
You can use something like this (Not tested):
const arr = ["", "/a", "", ""];
Promise.all(arr.map(fetch)
.then(responses=>responses.filter(res=> res.ok).map(res=>res.url))
.then(workingUrls=>{
console.log(workingUrls);
console.log(arr.filter(url=> workingUrls.indexOf(url) == -1 ))
});
EDITED
Working fiddle (Note that you can't do request to another site in the browser because of Cross domain).
UPDATED with #vp_arth suggestions
const arr = ["/", "/a", "/", "/"];
let working=[], notWorking=[],
find = url=> fetch(url)
.then(res=> res.ok ?
working.push(res.url) && res : notWorking.push(res.url) && res);
Promise.all(arr.map(find))
.then(responses=>{
console.log('woking', working, 'notWorking', notWorking);
/* Do whatever with the responses if needed */
});
Fiddle
I am trying to scrape data from a page using cheerio and request in the following way:
1) go to url 1a (http://example.com/0)
2) extract url 1b (http://example2.com/52)
3) go to url 1b
4) extract some data and save
5) go to url 1a+1 (http://example.com/1, let's call it 2a)
6) extract url 2b (http://example2.com/693)
7) go to url 2b
8) extract some data and save etc...
I am struggling work out how to do this (note, I only am familiar with node js and cheerio/request for this task even though it is likely not elegant, so am not looking for alternative libraries or languages to do this in, sorry). I think I am missing something because I can't even think how this could work.
EDIT
Let me try this in another way. here is the first part of code:
var request = require('request'),
cheerio = require('cheerio');
request('http://api.trove.nla.gov.au/result?key=6k6oagt6ott4ohno&zone=book&l-advformat=Thesis&sortby=dateDesc&q=+date%3A[2000+TO+2014]&l-availability=y&l-australian=y&n=1&s=0', function(error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html, {
xmlMode: true
});
var id = ($('work').attr('id'))
var total = ($('record').attr('total'))
}
});
The first returned page looks like this
<response>
<query>date:[2000 TO 2014]</query>
<zone name="book">
<records s="0" n="1" total="69977" next="/result?l-advformat=Thesis&sortby=dateDesc&q=+date%3A%5B2000+TO+2014%5D&l-availability=y&l-australian=y&n=1&zone=book&s=1">
<work id="189231549" url="/work/189231549">
<troveUrl>http://trove.nla.gov.au/work/189231549</troveUrl>
<title>
Design of physiological control and magnetic levitation systems for a total artificial heart
</title>
<contributor>Greatrex, Nicholas Anthony</contributor>
<issued>2014</issued>
<type>Thesis</type>
<holdingsCount>1</holdingsCount>
<versionCount>1</versionCount>
<relevance score="0.001961126">vaguely relevant</relevance>
<identifier type="url" linktype="fulltext">http://eprints.qut.edu.au/65642/</identifier>
</work>
</records>
</zone>
</response>
The URL above needs to increase incrementally s=0, s=1 etc. for 'total' number of times.
'id' needs to be fed into the url below in a second request:
request('http://api.trove.nla.gov.au/work/" +(id)+ "?key=6k6oagt6ott4ohno&reclevel=full', function(error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html, {
xmlMode: true
});
//extract data here etc.
}
});
For example when using id="189231549" returned by the first request the second returned page looks like this
<work id="189231549" url="/work/189231549">
<troveUrl>http://trove.nla.gov.au/work/189231549</troveUrl>
<title>
Design of physiological control and magnetic levitation systems for a total artificial heart
</title>
<contributor>Greatrex, Nicholas Anthony</contributor>
<issued>2014</issued>
<type>Thesis</type>
<subject>Total Artificial Heart</subject>
<subject>Magnetic Levitation</subject>
<subject>Physiological Control</subject>
<abstract>
Total Artificial Hearts are mechanical pumps which can be used to replace the failing natural heart. This novel study developed a means of controlling a new design of pump to reproduce physiological flow bringing closer the realisation of a practical artificial heart. Using a mathematical model of the device, an optimisation algorithm was used to determine the best configuration for the magnetic levitation system of the pump. The prototype device was constructed and tested in a mock circulation loop. A physiological controller was designed to replicate the Frank-Starling like balancing behaviour of the natural heart. The device and controller provided sufficient support for a human patient while also demonstrating good response to various physiological conditions and events. This novel work brings the design of a practical artificial heart closer to realisation.
</abstract>
<language>English</language>
<holdingsCount>1</holdingsCount>
<versionCount>1</versionCount>
<tagCount>0</tagCount>
<commentCount>0</commentCount>
<listCount>0</listCount>
<identifier type="url" linktype="fulltext">http://eprints.qut.edu.au/65642/</identifier>
</work>
So my question is now how do I tie these two parts (loops) together to achieve the result (download and parse about 70000 pages)?
I have no idea how to code this in JavaScript for Node.js. I am new to JavaScript
You can find out how to do it by studying existing famous website copiers (closed source or open source)
For example - use trial copy of http://www.tenmax.com/teleport/pro/home.htm to scrap your pages and then try the same with http://www.httrack.com and you should get the idea how they did it (and how you can do it) quite clearly.
The key programming concepts are lookup cache and task queue
Recursion is not the successful concept here if your solution should scale well up to several node.js worker processes and up to many pages
EDIT: after clarifying comments
Before you start reworking your scrapping engine into more scale-able architecture, as a new Node.js developer you can start simply with synchronized alternative to the Node.js callback hell as provided by the wait.for package created by #lucio-m-tato.
The code below worked for me with the links you provided
var request = require('request');
var cheerio = require('cheerio');
var wait = require("wait.for");
function requestWaitForWrapper(url, callback) {
request(url, function(error, response, html) {
if (error)
callback(error, response);
else if (response.statusCode == 200)
callback(null, html);
else
callback(new Error("Status not 200 OK"), response);
});
}
function readBookInfo(baseUrl, s) {
var html = wait.for(requestWaitForWrapper, baseUrl + '&s=' + s.toString());
var $ = cheerio.load(html, {
xmlMode: true
});
return {
s: s,
id: $('work').attr('id'),
total: parseInt($('records').attr('total'))
};
}
function readWorkInfo(id) {
var html = wait.for(requestWaitForWrapper, 'http://api.trove.nla.gov.au/work/' + id.toString() + '?key=6k6oagt6ott4ohno&reclevel=full');
var $ = cheerio.load(html, {
xmlMode: true
});
return {
title: $('title').text(),
contributor: $('contributor').text()
}
}
function main() {
var baseBookUrl = 'http://api.trove.nla.gov.au/result?key=6k6oagt6ott4ohno&zone=book&l-advformat=Thesis&sortby=dateDesc&q=+date%3A[2000+TO+2014]&l-availability=y&l-australian=y&n=1';
var baseInfo = readBookInfo(baseBookUrl, 0);
for (var s = 0; s < baseInfo.total; s++) {
var bookInfo = readBookInfo(baseBookUrl, s);
var workInfo = readWorkInfo(bookInfo.id);
console.log(bookInfo.id + ";" + workInfo.contributor + ";" + workInfo.title);
}
}
wait.launchFiber(main);
You could use the additional async module to handle multiple request and iteration through several pages. Read more about async here https://github.com/caolan/async.
Complete Node.js noob, so dont judge me...
I have a simple requirement. Crawl a web site, find all the product pages, and save some data from the product pages.
Simpler said then done.
Looking at Node.js samples, i cant find something similar.
There a request scraper:
request({uri:'http://www.google.com'}, function (error, response, body) {
if (!error && response.statusCode == 200) {
var window = jsdom.jsdom(body).createWindow();
jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
// jQuery is now loaded on the jsdom window created from 'body'
jQuery('.someClass').each(function () { /* Your custom logic */ });
});
}
});
But i cant figure out how to call it self once it scrapes the root page, or to populate an array or url's that it needs to scrape.
Then there's the http agent way:
var agent = httpAgent.create('www.google.com', ['finance', 'news', 'images']);
agent.addListener('next', function (err, agent) {
var window = jsdom.jsdom(agent.body).createWindow();
jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
// jQuery is now loaded on the jsdom window created from 'agent.body'
jquery('.someClass').each(function () { /* Your Custom Logic */ });
agent.next();
});
});
agent.addListener('stop', function (agent) {
sys.puts('the agent has stopped');
});
agent.start();
Which takes an array of locations, but then again, once you get it started with an array, you cant add more locations to it to go through all the product pages.
And i cant even get Apricot working, for some reason i'm getting an error.
So, how do i modify any of the above examples (or anything not listed above) to scrape a site, find all the product pages, find some data in there (the jquery.someclass example should do the trick) and that save that to a db?
Thanks!
Personally, I use Node IO to scrape some websites. https://github.com/chriso/node.io
More details about scraping can be found in the wiki !
I've had pretty good success crawling and scraping with Casperjs. It's a pretty nice library built on top of Phantomjs. I like it because it's fairly succinct. Callbacks can be executed as foo.then() which is super-simple to understand and I even can use jQuery since Phantomjs is an implementation of webkit. For example, the following would instantiate an instance of Casper and push all links on an archive page to an array called 'links'.
var casper = require("casper").create();
var numberOfLinks = 0;
var currentLink = 0;
var links = [];
var buildPage, capture, selectLink, grabContent, writeContent;
casper.start("http://www.yoursitehere.com/page_to/scrape/", function() {
numberOfLinks = this.evaluate(function() {
return __utils__.findAll('.nav-selector a').length;
});
this.echo(numberOfLinks + " items found");
// cause jquery makes it easier
casper.page.injectJs('/PATH/TO/jquery.js');
});
// Capture links
capture = function() {
links = this.evaluate(function() {
var link = [];
jQuery('.nav-selector a').each(function() {
link.push($(this).attr('href'));
});
return link;
});
this.then(selectLink);
};
You can then use node fs (or whatever else you want, really) to push your data into XML, CSV, or whatever you want. The example for scraping BBC photos was exceptionally helpful when I built my scraper.
This is a view from 10,000 feet of what casper can do. It has a very potent and broad API. I dig it, in case you couldn't tell :).
My full scraping example is here: https://gist.github.com/imjared/5201405.