I'm trying to write a scraper using 'request' and 'cheerio'. I have an array of 100 urls. I'm looping over the array and using 'request' on each url and then doing cheerio.load(body). If I increase i above 3 (i.e. change it to i < 3 for testing) the scraper breaks because var productNumber is undefined and I can't call split on undefined variable. I think that the for loop is moving on before the webpage responds and has time to load the body with cheerio, and this question: nodeJS - Using a callback function with Cheerio would seem to agree.
My problem is that I don't understand how I can make sure the webpage has 'loaded' or been parsed in each iteration of the loop so that I don't get any undefined variables. According to the other answer I don't need a callback, but then how do I do it?
for (var i = 0; i < productLinks.length; i++) {
productUrl = productLinks[i];
request(productUrl, function(err, resp, body) {
if (err)
throw err;
$ = cheerio.load(body);
var imageUrl = $("#bigImage").attr('src'),
productNumber = $("#product").attr('class').split(/\s+/)[3].split("_")[1]
console.log(productNumber);
});
};
Example of output:
1461536
1499543
TypeError: Cannot call method 'split' of undefined
Since you're not creating a new $ variable for each iteration, it's being overwritten when a request is completed. This can lead to undefined behaviour, where one iteration of the loop is using $ just as it's being overwritten by another iteration.
So try creating a new variable:
var $ = cheerio.load(body);
^^^ this is the important part
Also, you are correct in assuming that the loop continues before the request is completed (in your situation, it isn't cheerio.load that is asynchronous, but request is). That's how asynchronous I/O works.
To coordinate asynchronous operations you can use, for instance, the async module; in this case, async.eachSeries might be useful.
You are scraping some external site(s). You can't be sure the HTML all fits exactly the same structure, so you need to be defensive on how you traverse it.
var product = $('#product');
if (!product) return console.log('Cannot find a product element');
var productClass = product.attr('class');
if (!productClass) return console.log('Product element does not have a class defined');
var productNumber = productClass.split(/\s+/)[3].split("_")[1];
console.log(productNumber);
This'll help you debug where things are going wrong, and perhaps indicate that you can't scrape your dataset as easily as you'd hoped.
Related
I am new to nodejs. Using bluebird promises to get the response of an array of HTTP API calls, and storing derived results in an ElasticSearch.
Everything is working fine, except I am unable to access the variables within the 'then' function. Below is my code:
Promise.map(bucket_paths, function(path) {
this.path = path;
return getJson.getStoreJson(things,path.path);
}, {concurrency:1}).then(function(bucketStats){
bucketStats.map(function(bucketStat) {
var bucket_stats_json = {};
bucket_stats_json.timestamp = new Date();
bucket_stats_json.name = path.name ==> NOT WORKING
});
});
How can I access the path.name variable within the 'then' ? Error says 'path' is undefined.
The best way to do this is to package the data you need from one part of the promise chain into the resolved value that is sent onto the next part of the chain. In your case with Promise.map(), you're sending an array of data onto the .then() handler so the cleanest way to pass each path down to the next stage is to make it part of each array entry that Promise.map() is resolving. It appears you can just add it to the bucketStat data structure with an extra .then() as show below. When you get the data that corresponds to a path, you then add the path into that data structure so later on when you're walking through all the results, you have the .path property for each object.
You don't show any actual result here so I don't know what you're ultimately trying to end up with, but hopefully you can get the general idea from this.
Also, I switched to Promise.mapSeries() since that's a shortcut when you want concurrency set to 1.
Promise.mapSeries(bucket_paths, function(path) {
return getJson.getStoreJson(things,path.path).then(bucketStat => {
// add the path into this item's data so we can get to it later
bucketStat.path = path;
return bucketStat;
});
}).then(function(bucketStats){
return bucketStats.map(function(bucketStat) {
var bucket_stats_json = {};
bucket_stats_json.timestamp = new Date();
bucket_stats_json.name = bucketStat.path.name;
return bucket_status_json;
});
});
So I'm building a simple wrapper around an API to fetch all results of a particular entity. The API method can only return up to 500 results at a time, but it's possible to retrieve all results using the skip parameter, which can be used to specify what index to start retrieving results from. The API also has a method which returns the number of results there are that exist in total.
I've spent some time battling using the request package, trying to come up with a way to concatenate all the results in order, then execute a callback which passes all the results through.
This is my code currently:
Donedone.prototype.getAllActiveIssues = function(callback){
var url = this.url;
request(url + `/issues/all_active.json?take=500`, function (error, response, body) {
if (!error && response.statusCode == 200) {
var data = JSON.parse(body);
var totalIssues = data.total_issues;
var issues = [];
for (let i=0; i < totalIssues; i+=500){
request(url + `/issues/all_active.json?skip=${i}&take=500`, function (error, response, body){
if (!error && response.statusCode == 200) {
console.log(JSON.parse(body).issues.length);
issues.concat(JSON.parse(body).issues);
console.log(issues); // returns [] on all occasions
//callback(issues);
} else{
console.log("AGHR");
}
});
}
} else {
console.log("ERROR IN GET ALL ACTIVE ISSUES");
}
});
};
So I'm starting off with an empty array, issues. I iterate through a for loop, each time increasing i by 500 and passing that as the skip param. As you can see, I'm logging the length of how many issues each response contains before concatenating them with the main issues variable.
The output, from a total of 869 results is this:
369
[]
500
[]
Why is my issues variable empty when I log it out? There are clearly results to concatenate with it.
A more general question: is this approach the best way to go about what I'm trying to achieve? I figured that even if my code did work, the nature of asynchronicity means it's entirely possible for the results to be concatenated in the wrong order.
Should I just use a synchronous request library?
Why is my issues variable empty when I log it out? There are clearly
results to concatenate with it.
A main problem here is that .concat() returns a new array. It doesn't add items onto the existing array.
You can change this:
issues.concat(JSON.parse(body).issues);
to this:
issues = issues.concat(JSON.parse(body).issues);
to make sure you are retaining the new concatenated array. This is a very common mistake.
You also potentially have sequencing issues in your array because you are running a for loop which is starting a whole bunch of requests at the same time and results may or may not arrive back in the proper order. You will still get the proper total number of issues, but they may not be in the order requested. I don't know if that is a problem for you or not. If that is a problem, we can also suggest a fix for that.
A more general question: is this approach the best way to go about
what I'm trying to achieve? I figured that even if my code did work,
the nature of asynchronicity means it's entirely possible for the
results to be concatenated in the wrong order.
Except for the ordering issue which can also be fixed, this is a reasonable way to do things. We would have to know more about your API to know if this is the most efficient way to use the API to get your results. Usually, you want to avoid making N repeated API calls to the same server and you'd rather make one API call to get all the results.
Should I just use a synchronous request library?
Absolutely not. node.js requires learning how to do asynchronous programming. It is a learning step for most people, but is how you get the best performance from node.js and should be learned and used.
Here's a way to collect all the results in reliable order using promises for synchronization and error propagation (which is hugely useful for async processing in node.js):
// promisify the request() function so it returns a promise
// whose fulfilled value is the request result
function requestP(url) {
return new Promise(function(resolve, reject) {
request(url, function(err, response, body) {
if (err || response.statusCode !== 200) {
reject({err: err, response: response});
} else {
resolve({response: response, body: body});
}
});
});
}
Donedone.prototype.getAllActiveIssues = function() {
var url = this.url;
return requestP(url + `/issues/all_active.json?take=500`).then(function(results) {
var data = JSON.parse(results.body);
var totalIssues = data.total_issues;
var promises = [];
for (let i = 0; i < totalIssues; i+= 500) {
promises.push(requestP(url + `/issues/all_active.json?skip=${i}&take=500`).then(function(results) {
return JSON.parse(results.body).issues;
}));
}
return Promise.all(promises).then(function(results) {
// results is an array of each chunk (which is itself an array) so we have an array of arrays
// now concat all results in order
return Array.prototype.concat.apply([], results);
})
});
}
xxx.getAllActiveIssues().then(function(issues) {
// process issues here
}, function(err) {
// process error here
})
I've got a problem with redis and nodejs. I have to loop through a list of phone numbers, and check if this number is present in my redis database. Here is my code :
function getContactList(contacts, callback) {
var contactList = {};
for(var i = 0; i < contacts.length; i++) {
var phoneNumber = contacts[i];
if(utils.isValidNumber(phoneNumber)) {
db.client().get(phoneNumber).then(function(reply) {
console.log("before");
contactList[phoneNumber] = reply;
});
}
}
console.log("after");
callback(contactList);
};
The "after" console log appears before the "before" console log, and the callback always return an empty contactList. This is because requests to redis are asynchronous if I understood well. But the thing is I don't know how to make it works.
How can I do ?
You have two main issues.
Your phoneNumber variable will not be what you want it to be. That can be fixed by changing to a .forEach() or .map() iteration of your array because that will create a local function scope for the current variable.
You have create a way to know when all the async operations are done. There are lots of duplicate questions/answers that show how to do that. You probably want to use Promise.all().
I'd suggest this solution that leverages the promises you already have:
function getContactList(contacts) {
var contactList = {};
return Promise.all(contacts.filter(utils.isValidNumber).map(function(phoneNumber) {
return db.client().get(phoneNumber).then(function(reply) {
// build custom object
constactList[phoneNumber] = reply;
});
})).then(function() {
// make contactList be the resolve value
return contactList;
});
}
getContactList.then(function(contactList) {
// use the contactList here
}, funtion(err) {
// process errors here
});
Here's how this works:
Call contacts.filter(utils.isValidNumber) to filter the array to only valid numbers.
Call .map() to iterate through that filtered array
return db.client().get(phoneNumber) from the .map() callback to create an array of promises.
After getting the data for the phone number, add that data to your custom contactList object (this is essentially a side effect of the .map() loop.
Use Promise.all() on the returned array of promises to know when they are all done.
Make the contactList object we built up be the resolve value of the returned promise.
Then, to call it just use the returned promise with .then() to get the final result. No need to add a callback argument when you already have a promise that you can just return.
The simplest solution may be to use MGET with a list of phone numbers and put the callback in the 'then' section.
You could also put the promises in an array and use Promise.all().
At some point you might want your function to return a promise rather than with callback, just to stay consistent.
Consider refactoring your NodeJS code to use Promises.
Bluebird is an excellent choice: http://bluebirdjs.com/docs/working-with-callbacks.html
you put async code into a for loop (sync operations). So, each iteration of the for loop is not waiting for the db.client(...) function to end.
Take a look at this stackoverflow answer, it explains how to make async loops :
Here
I'm using https://github.com/Haidy777/node-youtubeAPI-simplifier to grab some information from a playlist of Bounty Killers. The way, this library is setup seems to use Promise via Bluebird (https://github.com/petkaantonov/bluebird) which I don't know much about. Looking up the Beginner's Guide for BlueBird gives http://bluebirdjs.com/docs/beginners-guide.html which literally just shows
This article is partially or completely unfinished. You are welcome to create pull requests to help completing this article.
I am able to set up the library
var ytapi = require('node-youtubeapi-simplifier');
ytapi.setup('My Server Key');
As well as list some information about Bounty Killers
ytdata = [];
ytapi.playlistFunctions.getVideosForPlaylist('PLCCB0BFBF2BB4AB1D')
.then(function (data) {
for (var i = 0, len = data.length; i < len; i++) {
ytapi.videoFunctions.getDetailsForVideoIds([data[i].videoId])
.then(function (video) {
console.log(video);
// ytdata.push(video); <- Push a Bounty Killer Video
});
}
});
// console.log(ytdata); This gives []
Basically the above pulls the full playlist (normally there will be some pagination here depending on the length) then it takes the data from getVideosForPlaylist iterates the list and calls getDetailsForVideoIds for each YouTube video. All good here.
The issues arises with getting data out of this. I would like to push the video object to ytdata array and I'm unsure whether the empty array at the end is due to scoping or some out of sync such that console.log(ytdata) gets called before the API calls are finished.
How will I be able to get each Bounty Killer video into the ytdata array to be available globally?
console.log(ytdata) gets called before the API calls are finished
Spot on, that's exactly what's happening here, the API calls are async. Once you're using async functions, you must go the async way if you want to deal with the returned data. Your code could be written like this:
var ytapi = require('node-youtubeapi-simplifier');
ytapi.setup('My Server Key');
// this function return a promise you can "wait"
function getVideos() {
return ytapi.playlistFunctions
.getVideosForPlaylist('PLCCB0BFBF2BB4AB1D')
.then(function (videos) {
// extract all videoIds
var videoIds = videos.map(video => video.videoId);
// getDetailsForVideoIds is called with an array of videoIds
// and return a promise, one API call is enough
return ytapi.videoFunctions.getDetailsForVideoIds(videoIds);
});
}
getVideos().then(function (ydata) {
// this is the only place ydata is full of data
console.log(ydata);
});
I made use of ES6's arrow function in videos.map(video => video.videoId);, that should work if your nodejs is v4+.
console.log(ytdata) should be immediately AFTER your FOR loop. This data is NOT available until the promise is resolved and the FOR loop execution is complete and attempting to access it beforehand will give you an empty array.
(your current console.log is not working because that code is being executed immediately before the promise is resolved). Only code inside the THEN block is executed AFTER the promise is resolved.
If you NEED the data available NOW or ASAP and the requests for the videos is taking a long time then can you request 1 video at a time or on demand or on a separate thread (using a webworker maybe)? Can you implement caching?
Can you make the requests up front behind the scenes before the user even visits this page? (not sure this is a good idea but it is an idea)
Can you use video thumbnails (like youtube does) so that when the thumbnail is clicked then you start streaming and playing the video?
Some ideas ... Hope this helps
ytdata = [];
ytapi.playlistFunctions.getVideosForPlaylist('PLCCB0BFBF2BB4AB1D')
.then(function (data) {
// THE CODE INSIDE THIS THEN BLOCK IS EXECUTED WHEN ALL THE VIDEO IDS HAVE BEEN RETRIEVED AND ARE AVAILABLE
// YOU COULD SAVE THESE TO A DATASTORE IF YOU WANT
for (var i = 0, len = data.length; i < len; i++) {
var videoIds = [data[i].videoId];
ytapi.videoFunctions.getDetailsForVideoIds(videoIds)
.then(function (video) {
// THE CODE INSIDE THIS THEN BLOCK IS EXECUTED WHEN ALL THE DETAILS HAVE BEEN DOWNLOADED FOR ALL videoIds provided
// AGAIN YOU CAN DO WHATEVER YOU WANT WITH THESE DETAILS
// ALSO NOW THAT THE DATA IS AVAILABLE YOU MIGHT WANT TO HIDE THE LOADING ICON AND RENDER THE PAGE! AGAIN JUST AN IDEA, A DATA STORE WOULD PROVIDE FASTER ACCESS BUT YOU WOULD NEED TO UPDATE THE CACHE EVERY SO OFTEN
// ytdata.push(video); <- Push a Bounty Killer Video
});
// THE DETAILS FOR ANOTHER VIDEO BECOMES AVAILABLE AFTER EACH ITERATION OF THE FOR LOOP
}
// ALL THE DATA IS AVAILABLE WHEN THE FOR LOOP HAS COMPLETED
});
// This is executed immediately before YTAPI has responded.
// console.log(ytdata); This gives []
I need to use bluebird in my code and I have no idea how to use it. My code contains nested loops. When the user logs in, my code will run. It will begin to look for any files under the user, and if there are files then, it will loop through to get the name of the files, since the name is stored in a dictionary. Once it got the name, it will store the name in an array. Once all the names are stored, it will be passed along in res.render().
Here is my code:
router.post('/login', function(req, res){
var username = req.body.username;
var password = req.body.password;
Parse.User.logIn(username, password, {
success: function(user){
var Files = Parse.Object.extend("File");
var object = [];
var query = new Parse.Query(Files);
query.equalTo("user", Parse.User.current());
var temp;
query.find({
success:function(results){
for(var i=0; i< results.length; i++){
var file = results[i].toJSON();
for(var k in file){
if (k ==="javaFile"){
for(var t in file[k]){
if (t === "name"){
temp = file[k][t];
var getname = temp.split("-").pop();
object[i] = getname;
}
}
}
}
}
}
});
console.log(object);
res.render('filename', {title: 'File Name', FIles: object});
console.log(object);
},
error: function(user, error) {
console.log("Invalid username/password");
res.render('logins');
}
})
});
EDIT:The code doesn't work, because on the first and second console.log(object), I get an empty array. I am suppose to get one item in that array, because I have one file saved
JavaScript code is all parsed from top to bottom, but it doesn't necessarily execute in that order with asynchronous code. The problem is that you have the log statements inside of the success callback of your login function, but it's NOT inside of the query's success callback.
You have a few options:
Move the console.log statements inside of the inner success callback so that while they may be parsed at load time, they do not execute until both callbacks have been invoked.
Promisify functions that traditionally rely on and invoke callback functions, and hang then handlers off of the returned value to chain the promises together.
The first option is not using promises at all, but relying solely on callbacks. To flatten your code you will want to promisify the functions and then chain them.
I'm not familiar with the syntax you're using there with the success and error callbacks, nor am I familiar with Parse. Typically you would do something like:
query.find(someArgsHere, function(success, err) {
});
But then you would have to nest another callback inside of that, and another callback inside of that. To "flatten" the pyramid, we make the function return a promise instead, and then we can chain the promises. Assuming that Parse.User.logIn is a callback-style function (as is Parse.Query.find), you might do something like:
var Promise = require('bluebird');
var login = Promise.promisify(Parse.User.logIn);
var find = Promise.promisify(Parse.Query.find);
var outerOutput = [];
return login(yourArgsHere)
.then(function(user) {
return find(user.someValue);
})
.then(function(results) {
var innerOutput = [];
// do something with innerOutput or outerOutput and render it
});
This should look familiar to synchronous code that you might be used to, except instead of saving the returned value into a variable and then passing that variable to your next function call, you use "then" handlers to chain the promises together. You could either create the entire output variable inside of the second then handler, or you can declare the variable output prior to even starting this promise chain, and then it will be in scope for all of those functions. I have shown you both options above, but obviously you don't need to define both of those variables and assign them values. Just pick the option that suits your needs.
You can also use Bluebird's promisifyAll() function to wrap an entire library with equivalent promise-returning functions. They will all have the same name of the functions in the library suffixed with Async. So assuming the Parse library contains callback-style functions named someFunctionName() and someOtherFunc() you could do this:
var Parse = Promise.promisifyAll(require("Parse"));
var promiseyFunction = function() {
return Parse.someFunctionNameAsync()
.then(function(result) {
return Parse.someOtherFuncAsync(result.someProperty);
})
.then(function(otherFuncResult) {
var something;
// do stuff to assign a value to something
return something;
});
}
I have a few pointers. ... Btw tho, are you trying to use Parse's Promises?
You can get rid of those inner nested loops and a few other changes:
Use some syntax like this to be more elegant:
/// You could use a map function like this to get the files into an array of just thier names
var fileNames = matchedFiles.map(function _getJavaFile(item) {
return item && item.javaFile && item.javaFile.name // NOT NULL
&& item.javaFile.name.split('-')[0]; // RETURN first part of name
});
// Example to filter/retrieve only valid file objs (with dashes in name)
var matchedFiles = results.filter(function _hasJavaFile(item) {
return item && item.javaFile && item.javaFile.name // NOT NULL
&& item.javaFile.name.indexOf('-') > -1; // and has a dash
});
And here is an example on using Parse's native promises (add code above to line 4/5 below, note the 'then()' function, that's effectively now your 'callback' handler):
var GameScore = Parse.Object.extend("GameScore");
var query = new Parse.Query(GameScore);
query.select("score", "playerName");
query.find().then(function(results) {
// each of results will only have the selected fields available.
});