I have a Node.js application that uses the request library to GET a number of URLs. The code is something like this:
for(var i = 0; i < docs.length; i++){
var URL = docs[i].url;
request(URL, function(error, response, html){
console.log(URL);
// Other code...
}
For the sake of simplicity, let’s say docs contain URLs like [URL1, URL2, URL3, ...]. With the first iteration, URL = URL1 and a request is sent to that URL. With the second iteration, a request is sent to URL2 and so on. However, at the end of the loop, URL = URLn. Inside the event complete function, when I log URL, I always get URLn. However, I need to able to get the respective URLs that is [URL1, URL2, URL3, ...].
Any idea, how I can maintain a local copy of the URL, that remains unchanged event when the global URL gets changed?
This must be something easy, but I can't figure it out.
Basically, what you experience is normal behavior in JavaScript and no special behavior of Node.js.
The only thing in JavaScript that defines a scope are functions, and functions can access their own scope as well as any "outer" scopes.
Hence, the solution is to wrap your code in a function that takes the global variable as a parameter and provides it to the code within the function as a parameter: This one is evaluated for each call of the function, hence your inner code will get its own "copy".
Basically you have two options. Either use an immediate executed function expression. This is basically nothing but a name-less (i.e. anonymous) function that is called immediately where it is defined:
for(var i = 0; i < docs.length; i++){
(function (url) {
request(url, function(error, response, html){
console.log(url);
});
})(doc.url);
}
Or use the built-in forEach function of an array which automatically wraps its body in a function (which results in the same effect):
docs.forEach(function (url) {
request(url, function(error, response, html){
console.log(url);
});
});
You should read about closures in JavaScript here.
Meanwhile, in simpler terms, the value of i would reach n at the end of all iterations. Hence rightly you would get URLn every time. If you wrap the request inside a immediately invoked function expression, you are creating another level in the scope chain. By doing so, the callback to request method won't refer to the variable i in global scope, but to the variable i that was available in the scope of the function at the time of sending the request. And that value of i you expect.
Code would be something like this then:
for(var i = 0; i < docs.length; i++){
var URL = docs[i].url;
(function(currentURL) {
//now the URL is preserved as currentURL inside the scope of the function
request(currentURL, function(error, response, html){
console.log(currentURL);
// This value of currentURL is the one that was available in the scope chain
// Other code...
});
})(URL);
}
Just wrap the code in a function or use forEach. This happens because of closure scope.
docs.forEach(functiom(doc) {
var URL = doc.url;
request(URL, function(error, response, html){
console.log(URL);
// Other code...
})
});
Another fix
for(var i = 0; i < docs.length; i++){
makeRequest(docs[i]);
}
function makeRequest(doc) {
var URL = doc.url;
request(URL, function(error, response, html){
console.log(URL);
});
}
And another a bit more uglier fix with a closure inside the for loop
for(var i = 0; i < docs.length; i++){
(function(doc) {
var URL = doc.url;
request(URL, function(error, response, html){
console.log(URL);
// Other code...
});
})(docs[i]);
}
If you use something like JSHint, it will warn you not to create functions inside for loops as it will cause problems like this.
Just use let instead of var, i.e.:
for(let i = 0; i < docs.length; i++){
let URL = docs[i].url;
request(URL, function(error, response, html){
console.log(URL);
//other code...
}
Related
I have a loop like this:
var req;
for (var i=0; i<sites.length; i++) {
req = https.get(sites[i], handleRequest);
req.on('error', handleError);
}
The callback (handleRequest) runs asynchronously, for each website being requested.
However, the only parameter in handleRequest seems to be a "response".
When the callback is run, the loop has already completed, so how can I keep track of which website is this response for, so I can handle it accordingly?
You can change your handleRequest to take two parameters - url being the first of them. With that you can partially apply the function via Function#bind and so set the url parameter at the time of calling but you'll still wait for the second argument.
let sites = [
"https://example.com",
"https://example.net",
"https://example.org"
];
function handleRequest(url, res) {
console.log("handling:", url);
/* handling code */
}
//minimalistic dummy HTTP module that responds after 1 second
let https = {
get: handler => setTimeout(handler, 1000)
}
for (var i=0; i < sites.length; i++) {
let url = sites[i];
https.get(handleRequest.bind(this, url)) //partially apply handleRequest
}
You can get a similar result via currying - instead of having two parameters, first take one, then return a function that takes the other. It leads to (in my opinion) better syntax when calling:
let sites = [
"https://example.com",
"https://example.net",
"https://example.org"
];
function handleRequest(url) {
return function actualHandler(res) {
console.log("handling:", url);
/* handling code */
}
}
//minimalistic dummy HTTP module that responds after 1 second
let https = {
get: handler => setTimeout(handler, 1000)
}
for (var i=0; i < sites.length; i++) {
let url = sites[i];
https.get(handleRequest(url))
}
I'm just starting off with Node.js and struggling with some of the finer points of non-blocking (asynchronous?) code. I know there are lots of questions about blocking vs non-blocking code already, but after reading through some of them, I still couldn't sort out this issue.
As a learning exercise, I made a simple script that loads URLs from a file, queries them using the request module, and notifies me if a URL is the New York Times homepage.
Here is a MWE:
// CSV Parse test
'use strict';
var request = require('request');
var fs = require('fs');
var parse = require('csv-parse');
var text = fs.readFileSync('input.txt','utf8');
var r_isnyt = /New York Times/;
var data = [];
parse(text, {skip_empty_lines: true}, function(err, data){
for (var r = 0; r < data.length; r++) {
console.log ('Logging from within parse function:');
console.log ('URL: '+data[r][0]+'\n');
var url = data[r][0];
request(url, function(error, response, body) {
console.log ('Logging from within request function:');
console.log('Loading URL: '+url+'\n');
if (!error && response.statusCode == 200) {
if (r_isnyt.exec(body)){
console.log('This is the NYT site! ');
}
console.log ('');
}
});
}
});
And here is my input.txt:
http://www.nytimes.com/
www.google.com
From what I understood of non-blocking code, this program's flow would be:
parse(text, {skip_empty_lines: true}, function(err, data){ loads the data and returns the lines of the input file in a 2D array, which is complete and available right after that line.
For Loop iterates through it, loading URLs with the line request(url, function(error, response, body) {, which is non-blocking (right?), so the For loop continues without waiting for the previous URL to finish loading.
As a result, you could have multiple URLs being loaded at once, and the console.log calls within request will print in the order the responses are received, not the order of the input file.
Within request, which has access to the results of the request to url, we print the URL, check whether it's the New York Times, and print the result of that check (all blocking steps I thought).
That's a long-winded way of getting around to my question. I just wanted to clarify that I thought I understood the basic concepts of non-blocking code. So what's baffling me is that my output is as follows:
>node parsecsv.js
Logging from within parse function:
URL: http://www.nytimes.com/
Logging from within parse function:
URL: www.google.com
Logging from within request function:
Loading URL: www.google.com
Logging from within request function:
Loading URL: www.google.com
This is the NYT site!
>
I understand why the request printouts all happen together at the end, but why do they both print Google, and much more baffling, why does the last one say it's the NYT site, when the log line right before it (from within the same request call) has just printed Google? It's like the request calls are getting the correct URLs, but the console.log calls are lagging, and just print everything at the end with the ending values.
Interestingly, if I reverse the order of the URLs, everything looks correct in the output, I guess because of differences in response times from the sites:
node parsecsv.js
Logging from within parse function:
URL: www.google.com
Logging from within request function:
Loading URL: www.google.com
Logging from within parse function:
URL: http://www.nytimes.com/
Logging from within request function:
Loading URL: http://www.nytimes.com/
This is the NYT site!
>
Thanks in advance.
Update
Based on the answer from jfriend00 below, I've changed my code to use a .forEach loop instead as follows. This appears to fix the issue.
// CSV Parse test
'use strict';
var request = require('request');
var fs = require('fs');
var parse = require('csv-parse');
var text = fs.readFileSync('input.txt','utf8');
var r_isnyt = /New York Times/;
var data = [];
parse(text, {skip_empty_lines: true}, function(err, data){
data.forEach( function(row) {
console.log ('Logging from within parse function:');
console.log ('URL: '+row[0]+'\n');
let url = row[0];
request(url, function(error, response, body) {
console.log ('Logging from within request function:');
console.log('Loading URL: '+url+'\n');
if (!error && response.statusCode == 200) {
if (r_isnyt.exec(body)){
console.log('This is the NYT site! ');
}
console.log ('');
}
});
});
});
I understand why the request printouts all happen together at the end,
but why do they both print Google, and much more baffling, why does
the last one say it's the NYT site, when the log line right before it
(from within the same request call) has just printed Google? It's like
the request calls are getting the correct URLs, but the console.log
calls are lagging, and just print everything at the end with the
ending values.
You correctly understand that the for loop initiates all the request() calls and then they finish sometime later in whatever order the responses come back in.
But, your logging statement:
console.log('Loading URL: '+url+'\n');
refers to a variable in your for loop which is shared by all the iterations of your for loop. So, since the for loop runs to completion and THEN sometime later all the responses arrive and get processed, your for loop will have finished by the time any of the responses get processed and thus the variable url will have whatever value it has in it when the for loop finishes which will be the value from the last iteration of the for loop.
In ES6, you can define the variable with let instead of var and it will be block scopes so there will be a unique variable url for each iteration of the loop.
So, change:
var url = data[r][0];
to
let url = data[r][0];
Prior to ES6, a common way to avoid this issue is to use .forEach() to iterate since it takes a callback function so all your loop code is in its own scope by nature of how .forEach() works and thus each iteration has its own local variables rather than shared local variables.
FYI, though let solves this issue and is one of the things it was designed for, I think your code would probably be a bit cleaner if you just used .forEach() for your iteration since it would replace multiple references to data[r] with a single reference to the current array iteration value.
parse(text, {skip_empty_lines: true}, function(err, data){
data.forEach( function(row) {
console.log ('Logging from within parse function:');
console.log ('URL: '+row[0]+'\n');
let url = row[0];
request(url, function(error, response, body) {
console.log ('Logging from within request function:');
console.log('Loading URL: '+url+'\n');
if (!error && response.statusCode == 200) {
if (r_isnyt.exec(body)){
console.log('This is the NYT site! ');
}
console.log ('');
}
});
});
});
Your code is fine and you're correct about how it works (including that differences in response times are what's making everything seem good when you switch the order around), but your logging has fallen victim to an unexpected closure: url is declared and updated in the scope of the parse() callback, and in the case where www.google.com is logged both times, it is being updated to its final value by the loop before your request() callbacks start executing.
I am a total scrub with the node http module and having some trouble.
The ultimate goal here is to take a huge list of urls, figure out which are valid and then scrape those pages for certain data. So step one is figuring out if a URL is valid and this simple exercise is baffling me.
say we have an array allURLs:
["www.yahoo.com", "www.stackoverflow.com", "www.sdfhksdjfksjdhg.net"]
The goal is to iterate this array, make a get request to each and if a response comes in, add the link to a list of workingURLs (for now just another array), else it goes to a list brokenURLs.
var workingURLs = [];
var brokenURLs = [];
for (var i = 0; i < allURLs.length; i++) {
var url = allURLs[i];
var req = http.get(url, function (res) {
if (res) {
workingURLs.push(?????); // How to derive URL from response?
}
});
req.on('error', function (e) {
brokenURLs.push(e.host);
});
}
what I don't know is how to properly obtain the url from the request/ response object itself, or really how to structure this kind of async code - because again, I am a nodejs scrub :(
For most websites using res.headers.location works, but there are times when the headers do not have this property and that will cause problems for me later on. Also I've tried console logging the response object itself and that was a messy and fruitless endeavor
I have tried pushing the url variable to workingURLs, but by the time any response comes back that would trigger the push, the for loop is already over and url is forever pointing to the final element of the allURLs array.
Thanks to anyone who can help
You need to closure url value to have access to it and protect it from changes on next loop iteration.
For example:
(function(url){
// use url here
})(allUrls[i]);
Most simple solution for this is use forEach instead of for.
allURLs.forEach(function(url){
//....
});
Promisified solution allows you to get a moment when work is done:
var http = require('http');
var allURLs = [
"http://www.yahoo.com/",
"http://www.stackoverflow.com/",
"http://www.sdfhksdjfksjdhg.net/"
];
var workingURLs = [];
var brokenURLs = [];
var promises = allURLs.map(url => validateUrl(url)
.then(res => (res?workingURLs:brokenURLs).push(url)));
Promise.all(promises).then(() => {
console.log(workingURLs, brokenURLs);
});
// ----
function validateUrl(url) {
return new Promise((ok, fail) => {
http.get(url, res => return ok(res.statusCode == 200))
.on('error', e => ok(false));
});
}
// Prevent nodejs from exit, don't need if any server listen.
var t = setTimeout(() => { console.log('Time is over'); }, 1000).ref();
You can use something like this (Not tested):
const arr = ["", "/a", "", ""];
Promise.all(arr.map(fetch)
.then(responses=>responses.filter(res=> res.ok).map(res=>res.url))
.then(workingUrls=>{
console.log(workingUrls);
console.log(arr.filter(url=> workingUrls.indexOf(url) == -1 ))
});
EDITED
Working fiddle (Note that you can't do request to another site in the browser because of Cross domain).
UPDATED with #vp_arth suggestions
const arr = ["/", "/a", "/", "/"];
let working=[], notWorking=[],
find = url=> fetch(url)
.then(res=> res.ok ?
working.push(res.url) && res : notWorking.push(res.url) && res);
Promise.all(arr.map(find))
.then(responses=>{
console.log('woking', working, 'notWorking', notWorking);
/* Do whatever with the responses if needed */
});
Fiddle
My input is streamed from another source, which makes it difficult to use async.forEach. I am pulling data from an API endpoint, but I have a limit of 1000 objects per request to the endpoint, and I need to get hundreds of thousands of them (basically all of them) and I will know they're finished when the response contains < 1000 objects. Now, I have tried this approach:
/* List all deposits */
var depositsAll = [];
var depositsIteration = [];
async.doWhilst(this._post(endpoint_path, function (err, response) {
// check err
/* Loop through the data and gather only the deposits */
for (var key in response) {
//do some stuff
}
depositsAll += depositsIteration;
return callback(null, depositsAll);
}, {limit: 1000, offset: 0, sort: 'desc'}),
response.length > 1000, function (err, depositsAll) {
// check for err
// return the complete result
return callback(null, depositsAll);
});
With this code I get an async internal error that iterator is not a function. But in general I am almost sure the logic is not correct as well.
If it's not clear what I'm trying to achieve - I need to perform a request multiple times, and add the response data to a result that at the end contains all the results, so I can return it. And I need to perform requests until the response contains less than 1000 objects.
I also looked into async.queue but could not get the hang of it...
Any ideas?
You should be able to do it like that, but if that example is from your real code you have misunderstood some of how async works. doWhilst takes three arguments, each of them being a function:
The function to be called by async. Gets argument callback that must be called. In your case, you need to wrap this._post inside another function.
The test function (you would give value of response.length > 1000, ie. a boolean, if response would be defined)
The final function to be called once execution is stopped
Example with each needed function separated for readability:
var depositsAll = [];
var responseLength = 1000;
var self = this;
var post = function(asyncCb) {
self._post(endpoint_path, function(err, res) {
...
responseLength = res.length;
asyncCb(err, depositsAll);
});
}
var check = function() {
return responseLength >= 1000;
};
var done = function(err, deposits) {
console.log(deposits);
};
async.doWhilst(post, check, done);
I have a chrome extension browser action that I want to have list a series of links, and open any selected link in the current tab. So far what I have is this, using jquery:
var url = urlForThisLink;
var li = $('<li/>');
var ahref = $('' + title + '');
ahref.click(function(){
chrome.tabs.getSelected(null, function (tab) {
chrome.tabs.update(tab.id, {url: url});
});
});
li.append(ahref);
It partially works. It does navigate the current tab, but will only navigate to whichever link was last created in this manner. How can I do this for an iterated series of links?
#jmort253's answer is actually a good illustration of what is probably your error. Despite being declared inside the for loop, url has function scope since it is declared with var. So your click handler closure is binding to a variable scoped outside the for loop, and every instance of the closure uses the same value, ie. the last one.
Once Chrome supports the let keyword you will be able to use it instead of var and it will work fine since url will be scoped to the body of the for loop. In the meantime you'll have to create a new scope by creating your closure in a function:
function makeClickHandler(url) {
return function() { ... };
}
Inside the for loop say:
for (var i = 0; i < urls.length; i++) {
var url = urls[i];
...
ahref.click(makeClickHandler(url));
...
}
In your code example, it looks like you only have a single link. Instead, let's assume you have an actual collection of links. In that case, you can use a for loop to iterate through them:
// collection of urls
var urls = ["http://example.com", "http://domain.org"];
// loop through the collection, for each url, build a separate link.
for(var i = 0; i < urls.length; i++) {
// this is the link for iteration i
var url = urls[i];
var li = $('<li/>');
var ahref = $('' + title + '');
ahref.click( (function(pUrl) {
return function() {
chrome.tabs.getSelected(null, function (tab) {
chrome.tabs.update(tab.id, {url: pUrl});
});
}
})(url));
li.append(ahref);
}
I totally forgot about scope when writing the original answer, so I updated it to use a closure based on Matthew Gertner's answer. Basically, in the click event handler, I'm now passing in the url variable into an anonymous 1 argument function which returns another function. The returned function uses the argument passed into the anonymous function, so its state is unaffected by the fact that the next iterations of the for loop will change the value of url.