I have an external API that rate-limits API requests to up to 25 requests per second. I want to insert parts of the results into a MongoDB database.
How can I rate limit the request function so that I don't miss any of API results for all of the array?
MongoClient.connect('mongodb://127.0.0.1:27017/test', function (err, db) {
if (err) {
throw err;
} else {
for (var i = 0; i < arr.length; i++) {
//need to rate limit the following function, without missing any value in the arr array
request( {
method: 'GET',
url: 'https://SOME_API/json?address='+arr[i]
},
function (error, response, body) {
//doing computation, including inserting to mongo
}
)
};
}
});
This could possibly be done using the request-rate-limiter package. So you can add this to your code :
var RateLimiter = require('request-rate-limiter');
const REQS_PER_MIN = 25 * 60; // that's 25 per second
var limiter = new RateLimiter(REQS_PER_MIN);
and since request-rate-limiter is based on request you can just replace request with limiter.request
You can find further information on the package's npm page - https://www.npmjs.com/package/request-rate-limiter
On a personal note - I'd replace all these callbacks with promises
You need to combine 2 things.
A throttling mechanism. I suggest _.throttle from the lodash project. This can do the rate limiting for you.
You also need an async control flow mechanism to make sure the requests run in series (don't start second one until first one is done). For that I suggest async.eachSeries
Both of these changes will be cleaner if you refactor your code to this signature:
function scrape(address, callback) {
//code to fetch a single address, do computation, and save to mongo here
//invoke the callback with (error, result) when done
}
Related
So I'm building a simple wrapper around an API to fetch all results of a particular entity. The API method can only return up to 500 results at a time, but it's possible to retrieve all results using the skip parameter, which can be used to specify what index to start retrieving results from. The API also has a method which returns the number of results there are that exist in total.
I've spent some time battling using the request package, trying to come up with a way to concatenate all the results in order, then execute a callback which passes all the results through.
This is my code currently:
Donedone.prototype.getAllActiveIssues = function(callback){
var url = this.url;
request(url + `/issues/all_active.json?take=500`, function (error, response, body) {
if (!error && response.statusCode == 200) {
var data = JSON.parse(body);
var totalIssues = data.total_issues;
var issues = [];
for (let i=0; i < totalIssues; i+=500){
request(url + `/issues/all_active.json?skip=${i}&take=500`, function (error, response, body){
if (!error && response.statusCode == 200) {
console.log(JSON.parse(body).issues.length);
issues.concat(JSON.parse(body).issues);
console.log(issues); // returns [] on all occasions
//callback(issues);
} else{
console.log("AGHR");
}
});
}
} else {
console.log("ERROR IN GET ALL ACTIVE ISSUES");
}
});
};
So I'm starting off with an empty array, issues. I iterate through a for loop, each time increasing i by 500 and passing that as the skip param. As you can see, I'm logging the length of how many issues each response contains before concatenating them with the main issues variable.
The output, from a total of 869 results is this:
369
[]
500
[]
Why is my issues variable empty when I log it out? There are clearly results to concatenate with it.
A more general question: is this approach the best way to go about what I'm trying to achieve? I figured that even if my code did work, the nature of asynchronicity means it's entirely possible for the results to be concatenated in the wrong order.
Should I just use a synchronous request library?
Why is my issues variable empty when I log it out? There are clearly
results to concatenate with it.
A main problem here is that .concat() returns a new array. It doesn't add items onto the existing array.
You can change this:
issues.concat(JSON.parse(body).issues);
to this:
issues = issues.concat(JSON.parse(body).issues);
to make sure you are retaining the new concatenated array. This is a very common mistake.
You also potentially have sequencing issues in your array because you are running a for loop which is starting a whole bunch of requests at the same time and results may or may not arrive back in the proper order. You will still get the proper total number of issues, but they may not be in the order requested. I don't know if that is a problem for you or not. If that is a problem, we can also suggest a fix for that.
A more general question: is this approach the best way to go about
what I'm trying to achieve? I figured that even if my code did work,
the nature of asynchronicity means it's entirely possible for the
results to be concatenated in the wrong order.
Except for the ordering issue which can also be fixed, this is a reasonable way to do things. We would have to know more about your API to know if this is the most efficient way to use the API to get your results. Usually, you want to avoid making N repeated API calls to the same server and you'd rather make one API call to get all the results.
Should I just use a synchronous request library?
Absolutely not. node.js requires learning how to do asynchronous programming. It is a learning step for most people, but is how you get the best performance from node.js and should be learned and used.
Here's a way to collect all the results in reliable order using promises for synchronization and error propagation (which is hugely useful for async processing in node.js):
// promisify the request() function so it returns a promise
// whose fulfilled value is the request result
function requestP(url) {
return new Promise(function(resolve, reject) {
request(url, function(err, response, body) {
if (err || response.statusCode !== 200) {
reject({err: err, response: response});
} else {
resolve({response: response, body: body});
}
});
});
}
Donedone.prototype.getAllActiveIssues = function() {
var url = this.url;
return requestP(url + `/issues/all_active.json?take=500`).then(function(results) {
var data = JSON.parse(results.body);
var totalIssues = data.total_issues;
var promises = [];
for (let i = 0; i < totalIssues; i+= 500) {
promises.push(requestP(url + `/issues/all_active.json?skip=${i}&take=500`).then(function(results) {
return JSON.parse(results.body).issues;
}));
}
return Promise.all(promises).then(function(results) {
// results is an array of each chunk (which is itself an array) so we have an array of arrays
// now concat all results in order
return Array.prototype.concat.apply([], results);
})
});
}
xxx.getAllActiveIssues().then(function(issues) {
// process issues here
}, function(err) {
// process error here
})
Hello,
I use Node.js to provide an API for storing data on a MongoDB database.
I ran multiple tests on a read method, which takes ids and returns the corresponding documents. The point is that I must return these documents in the specified order. To ensure that, I use the following code:
// Sequentially fetch every element
function read(ids, callback) {
var i = 0;
var results = [];
function next() {
db.findOne(ids[i], function (err, doc) {
results.push(err ? null : doc);
if (ids.length > ++i) {
return next();
}
callback(results);
});
}
next();
}
This way, documents are fetched one-by-one, in the right order. It takes about 11s on my laptop to retrieve 27k documents.
However, I thought that it was possible to improve this method:
// Asynchronously map the whole array
var async = require('async');
function read(ids, callback) {
async.map(ids, db.findOne.bind(db), callback):
}
After running a single test, I was quite satisfied seeing that the 27k documents were retrieved in only 8s using simpler code.
The problem happens when I repeat the same request: the response time keeps growing (proportionally to the number of elements retrieved): 9s 10s 11s 12s.... This problem does not happen in the sequential version.
I tried two versions of Node.js, v6.2.0 and v0.10.29. The problem is the same. What causes this latency and how could I suppress it?
Try to use async.mapLimit to prevent overload. You need some tests to tune limit value with your environment.
But find({_id: {$in: list}}) is always better, because single database request instead of multiple.
I suggest you to try to perform restore of original order client-side.
Something like this:
function read(ids, cb) {
db.find(
{_id: {$in: ids.map(id => mongoose.Types.ObjectId(id))}},
process
);
function process(err, docs) {
if (err) return cb(err);
return cb(null, docs.sort(ordering))
}
function ordering(a, b) {
return ids.indexOf(b._id.toString()) - ids.indexOf(a._id.toString());
}
}
May be, find query needs to be corrected, I can't to know what exact mongodb driver you use.
This code is first-try, more manual sorting can improve performance alot. [].indexOf is heavy too(O(n)).
But I'm almost sure, even as-is now, it will work much faster.
Possible ordering replacement:
var idHash = {};
for(var i = 0; i < ids.length; i++)
idHash[ids[i]] = i;
function ordering(a, b) {
return idHash[b._id.toString()] - idHash[a._id.toString()];
}
Any sort algorithm has O(nlogn) in best case, but we already know result position of each found document, so, we can restore original order by O(n):
var idHash = ids.reduce((c, id, i) => (c[id] = i, c), {});
function process(err, docs) {
if (err) return cb(err);
return cb(null,
docs.reduce(
(c, doc) => (c[idHash[doc._id.toString()]] = doc, c),
ids.map(id => null))) //fill not_found docs by null
}
Functional style makes code flexier. For example this code can be easy modified to use async.reduce to be less sync-blocking.
I have an orientdb database. I want to use nodejs with RESTfull calls to create a large number of records. I need to get the #rid of each for some later processing.
My psuedo code is:
for each record
write.to.db(record)
when the async of write.to.db() finishes
process based on #rid
carryon()
I have landed in serious callback hell from this. The version that was closest used a tail recursion in the .then function to write the next record to the db. However, I couldn't carry on with the rest of the processing.
A final constraint is that I am behind a corporate proxy and cannot use any other packages without going through the network administrator, so using the native nodejs packages is essential.
Any suggestions?
With a completion callback, the general design pattern for this type of problem makes use of a local function for doing each write:
var records = ....; // array of records to write
var index = 0;
function writeNext(r) {
write.to.db(r, function(err) {
if (err) {
// error handling
} else {
++index;
if (index < records.length) {
writeOne(records[index]);
}
}
});
}
writeNext(records[0]);
The key here is that you can't use synchronous iterators like .forEach() because they won't iterate one at a time and wait for completion. Instead, you do your own iteration.
If your write function returns a promise, you can use the .reduce() pattern that is common for iterating an array.
var records = ...; // some array of records to write
records.reduce(function(p, r) {
return p.then(function() {
return write.to.db(r);
});
}, Promsise.resolve()).then(function() {
// all done here
}, function(err) {
// error here
});
This solution chains promises together, waiting for each one to resolve before executing the next save.
It's kinda hard to tell which function would be best for your scenario w/o more detail, but I almost always use asyncjs for this kind of thing.
From what you say, one way to do it would be with async.map:
var recordsToCreate = [...];
function functionThatCallsTheApi(record, cb){
// do the api call, then call cb(null, rid)
}
async.map(recordsToCreate, functionThatCallsTheApi, function(err, results){
// here, err will be if anything failed in any function
// results will be an array of the rids
});
You can also check out other ones to enable throttling, which is probablya good idea.
My input is streamed from another source, which makes it difficult to use async.forEach. I am pulling data from an API endpoint, but I have a limit of 1000 objects per request to the endpoint, and I need to get hundreds of thousands of them (basically all of them) and I will know they're finished when the response contains < 1000 objects. Now, I have tried this approach:
/* List all deposits */
var depositsAll = [];
var depositsIteration = [];
async.doWhilst(this._post(endpoint_path, function (err, response) {
// check err
/* Loop through the data and gather only the deposits */
for (var key in response) {
//do some stuff
}
depositsAll += depositsIteration;
return callback(null, depositsAll);
}, {limit: 1000, offset: 0, sort: 'desc'}),
response.length > 1000, function (err, depositsAll) {
// check for err
// return the complete result
return callback(null, depositsAll);
});
With this code I get an async internal error that iterator is not a function. But in general I am almost sure the logic is not correct as well.
If it's not clear what I'm trying to achieve - I need to perform a request multiple times, and add the response data to a result that at the end contains all the results, so I can return it. And I need to perform requests until the response contains less than 1000 objects.
I also looked into async.queue but could not get the hang of it...
Any ideas?
You should be able to do it like that, but if that example is from your real code you have misunderstood some of how async works. doWhilst takes three arguments, each of them being a function:
The function to be called by async. Gets argument callback that must be called. In your case, you need to wrap this._post inside another function.
The test function (you would give value of response.length > 1000, ie. a boolean, if response would be defined)
The final function to be called once execution is stopped
Example with each needed function separated for readability:
var depositsAll = [];
var responseLength = 1000;
var self = this;
var post = function(asyncCb) {
self._post(endpoint_path, function(err, res) {
...
responseLength = res.length;
asyncCb(err, depositsAll);
});
}
var check = function() {
return responseLength >= 1000;
};
var done = function(err, deposits) {
console.log(deposits);
};
async.doWhilst(post, check, done);
I am creating an array of JSON objects which is then stored in mongodb.
Each JSON object contains a number of fields - each being populated before I save the object to mongodb.
Some of the Objects attributes are populated by making API calls to other websites such as last.fm but the returned value is not quick enough to populate the attribute before the object is saved to mongodb.
How can I wait for all attributes of an object to be populated before saving it? I did try async.waterfall but it still falls through without waiting and I end up with a database filled with documents with empty fields..
Any help would be greatly appreciated.
Thanks :)
You have a few options for controlling asynchrony in JavaScript:
Callback pattern: (http://npmjs.org/async) async.all([...], function (err) {
Promises: (http://npmjs.org/q) Q.all([...]).then(function () {
Streams: (http://npmjs.org/concat-stream) see also https://github.com/substack/stream-handbook
Since you say you are making multiple API calls to other websites, you may want to try:
async.each(api_requests,
function(api_request, cb) {
request(api_request, function (error, response, body) {
/* code */
/* add to model for Mongo */
cb();
});
},
function(err) {
// continue execution after all cbs are received
/* code */
/* save to Mongo, etc.. */
}
);
The above example is most applicable when you are making numerous requests following the same format. Please review the documentation for Waterfall (https://github.com/caolan/async#waterfall) if the input into your next step depends on the output of the previous step or Parallel (https://github.com/caolan/async#parallel) if you have a bunch of unrelated tasks that don't rely on each other. The great thing about async is that you can nest and string all the functions together to support what you're trying to do.
You'll either want to use promises or some sort of callback mechanism. Here's an example of the promise method with jPromise:
var jPromise = require('jPromise');
var promises = [];
for(var i = 0; i < 10; i++) {
promises.push(someAsyncApiCall(i));
}
jPromise.when(promises).then(function() {
saveThingsToTheDb();
});
Similarly, without the promise library:
var finished = 0;
var toDo = 10;
function allDone() {
saveThingsToTheDb();
}
for(var i = 0; i < toDo.length; i++) {
someAsyncApiCall(function() {
finished++;
if(finished === toDo) {
allDone();
}
});
}
Personally, I prefer the promise method, but that will only that well if the API you're calling returns some sort of a promise. If it doesn't, you'll be SOL and wrap the callback API with promises somehow (Q does this pretty well).