I have an Orchestration that takes 100 search terms. Batches these search terms in a batch of 10 and fans out to start the search activities (each activity takes 10 names).
A search activity sequentially processes each name. For each name, it makes 2 search requests to azure search. One with space and punctuation and another without. To make the search request I call the REST API of the azure search.
The orchestration waits for all the search activities to resolve and return the result.
The issue I am facing is that the round trip for the azure search HTTP request is taking too long in the function app when deployed on azure.
At the start of the search, it takes 3-4 seconds for each request. But after few requests, the time for a single request goes up to 17-20 seconds.
Locally when I run this Orchestration with the same input and request to the same azure search index, it does not take more than 1.5 - 2 second for each request. Locally it takes 1.0-1.2 minutes for the Orchestration to complete. But deployed app takes 7-8 minutes for the same input and request to the same azure search index.
the following is how I make the request (code for the search activity funtion):
const request = require('request');
const requestDefault = request.defaults({
method: 'GET',
gzip: true,
json: true,
timeout: `some value`,
time: true,
pool: {maxSockets: 100}
});
module.exports = async function (context, names) {
let results = [];
for (let i = 0; i < names.length; i++) {
results.push(await search(context, names[i]));
results.push(await search(context, withOutSpaceAndPunctuations(names[i])));
}
return results;
}
function search(context, name) {
let url = createAzureSearchUrl(name);
return (new Promise((resolve, reject) => {
requestDefault({
uri: url,
headers: { 'api-key': `key` }
}, function (error, response, body) {
if (!error) {
context.log(`round trip time => ${response.elapsedTime/1000} sec`);
context.log(`elapsed-time for search => ${response.headers['elapsed-time']} ms`);
resolve(body.value);
} else {
reject(new Error(error));
}
})
}));
}
function createAzureSearchUrl(name) {
return `azure search url`;
}
The Orchestration
const df = require("durable-functions");
module.exports = df.orchestrator(function* (context) {
let names = context.bindings.context.input;
let chunk = 10;
let batches = [];
for (let i = 0; i < names.length; i += chunk) {
let slice = names.slice(i, i + chunk);
let batch = [];
for (let j = 0; j < slice.length; j++) {
batch.push(slice[j]);
}
batches.push(batch);
}
const tasks = [];
for (let i = 0; i < batches.length; i++) {
tasks.push(context.df.callActivity("Search", batches[i]));
}
let searchResults = yield context.df.Task.all(tasks);
return searchResults;
});
The elapsed-time for search is always less than 500 milliseconds.
According to this documentation I removed the request module and used the native https module. But it had no improvement.
var https = require('https');
https.globalAgent.maxSockets = 100;
function searchV2(context, name) {
let url = createAzureSearchUrl(name);
const t0 = performance.now();
return (new Promise((resolve, reject) => {
let options = {headers: { 'api-key': 'key' }}
https.get(url, options, (res) => {
onst t1 = performance.now();
context.log(`round trip time => ${(t1-t0)/1000} sec`);
context.log(`elapsed-time => ${res.headers['elapsed-time']}`);
res.on('data', (d) => {
resolve(d);
});
});
}));
}
For testing, I changed the batch count from 10 to 100 so that a single search activity processes all 100 search terms sequentially. Here all requests to azure search took 3.0-3.5 seconds. But 3.5sec * 200 req = 11.6666666667 minutes. So not fanning out is not an option.
The deployed app had a 1 instance count. I updated it to 6 instances. With 6 instances now takes 3.5 - 7.5 seconds for a single request. The total time for 100 search terms now takes 4.0 - 4.3 minutes. increasing instances to 6 had quite a lot of improvement. But still, it's taking 7.5seconds for a lot of requests. maxConcurrentActivityFunctions parameter was 6 in the host file.
I updated the instance count to 10 and maxConcurrentActivityFunctions to also 10. but it still takes 4.0 - 4.3 minutes for 100 search terms. No improvement. I saw a lot of requests taking up to 10 seconds.
I do not think it is a code-level issue. It has something to do with fanning out and making multiple concurrent requests for the same function.
Why is this happening to the deployed app and not locally? What should I do to decrease the request latency? Any suggestion will be appreciated.
My function app runs on the azure function App Service plan.
My DurableTask version is 1.7.1
The latency increases when there is indexing also happening in parallel. Is that the case for you? elapsed-time for the query may not be taking the latency into account.
On the Azure portal, when you navigate to your search resource, if you go to the monitoring tab, you should be able to see the latency, number of queries, % of throttled queries. That should provide some direction. What tier is your search service on? What is the number of partitions and what replicas that you provisioned for your search service?
As a test, you can increase the number of replicas and partitions to see if that helps with your performance. It did for me.
Related
My issues
Launch 1000+ online API that limits the number of API calls to 10 calls/sec.
Wait for all the API calls to give back a result (or retry), it can take 5 sec before the API sends it data
Use the combined data in the rest of my app
What I have tried while looking at a lot of different questions and answers here on the site
Use promise to wait for one API request
const https = require("https");
function myRequest(param) {
const options = {
host: "api.xxx.io",
port: 443,
path: "/custom/path/"+param,
method: "GET"
}
return new Promise(function(resolve, reject) {
https.request(options, function(result) {
let str = "";
result.on('data', function(chunk) {str += chunk;});
result.on('end', function() {resolve(JSON.parse(str));});
result.on('error', function(err) {console.log("Error: ", err);});
}).end();
});
};
Use Promise.all to do all the requests and wait for them to finish
const params = [{item: "param0"}, ... , {item: "param1000+"}]; // imagine 1000+ items
const promises = [];
base.map(function(params){
promises.push(myRequest(params.item));
});
result = Promise.all(promises).then(function(data) {
// doing some funky stuff with dat
});
So far so good, sort of
It works when I limit the number of API requests to a maximum of 10 because then the rate limiter kicks in. When I console.log(promises), it gives back an array of 'request'.
I have tried to add setTimeout in different places, like:
...
base.map(function(params){
promises.push(setTimeout(function() {
myRequest(params.item);
}, 100));
});
...
But that does not seem to work. When I console.log(promises), it gives back an array of 'function'
My questions
Now I am stuck ... any ideas?
How do I build in retries when the API gives an error
Thank you for reading up to hear, you are already a hero in my book!
When you have a complicated control-flow using async/await helps a lot to clarify the logic of the flow.
Let's start with the following simple algorithm to limit everything to 10 requests per second:
make 10 requests
wait 1 second
repeat until no more requests
For this the following simple implementation will work:
async function rateLimitedRequests (params) {
let results = [];
while (params.length > 0) {
let batch = [];
for (i=0; i<10; i++) {
let thisParam = params.pop();
if (thisParam) { // use shift instead
batch.push(myRequest(thisParam.item)); // of pop if you want
} // to process in the
// original order.
}
results = results.concat(await Promise.all(batch));
await delayOneSecond();
}
return results;
}
Now we just need to implement the one second delay. We can simply promisify setTimeout for this:
function delayOneSecond() {
return new Promise(ok => setTimeout(ok, 1000));
}
This will definitely give you a rate limiter of just 10 requests each second. In fact it performs somewhat slower than that because each batch will execute in request time + one second. This is perfectly fine and already meet your original intent but we can improve this to squeeze a few more requests to get as close as possible to exactly 10 requests per second.
We can try the following algorithm:
remember the start time
make 10 requests
compare end time with start time
delay one second minus request time
repeat until no more requests
Again, we can use almost exactly the same logic as the simple code above but just tweak it to do time calculations:
const ONE_SECOND = 1000;
async function rateLimitedRequests (params) {
let results = [];
while (params.length > 0) {
let batch = [];
let startTime = Date.now();
for (i=0; i<10; i++) {
let thisParam = params.pop();
if (thisParam) {
batch.push(myRequest(thisParam.item));
}
}
results = results.concat(await Promise.all(batch));
let endTime = Date.now();
let requestTime = endTime - startTime;
let delayTime = ONE_SECOND - requestTime;
if (delayTime > 0) {
await delay(delayTime);
}
}
return results;
}
Now instead of hardcoding the one second delay function we can write one that accept a delay period:
function delay(milliseconds) {
return new Promise(ok => setTimeout(ok, milliseconds));
}
We have here a simple, easy to understand function that will rate limit as close as possible to 10 requests per second. It is rather bursty in that it makes 10 parallel requests at the beginning of each one second period but it works. We can of course keep implementing more complicated algorithms to smooth out the request pattern etc. but I leave that to your creativity and as homework for the reader.
My app can have a large amount of writes, reads and updates (can even go above 10000) under certain circumstances.
While developing the application locally, these operations usually take a few seconds at most (great!) however, it can easily take minutes when running the application on google cloud, to the point that the Firebase function times out.
I developed a controlled test in a separate project, whose sole purpose is to write, get and delete thousands of items for bench-marking. These were the results (averaged out from several tests):
Local Emulator:
5000 items, 4.2s write, 2.2s delete
5000 items, batch mode ON, 0.75s write, 0.11s delete
Cloud Firestore:
100 items, 15.8s write, 14.5s delete
1000 items, batch mode ON, 4.8s write, 3.0s delete
5000 items, async mode ON, 10.2s write, 8.0s delete
5000 items, batch & async ON, 4.5s write, 3.9s delete
NOTE: My local emulator crashes whenever I try to perform db operations async (which is a problem for another day) but it is why I was unable to test the write/delete speeds asynchronously locally. Also, write and read values usually vary +-25% between runs.
However, as you can see, the fact that my local emulator is faster in its slowest mode compared to the fastest test in the cloud definitely raises some questions.
Could it be that I have some sort of configuration issue? or is it just that these numbers are standard for firestore? Here is the (summarised) typescript code if you wish to try it:
functions.runWith({ timeoutSeconds: 540, memory: "2GB" }).https.onRequest(async (req, res) => {
//getting the settings from the request
var data = req.body;
var numWrites: number = data.numWrites;
var syncMode: boolean = !data.asyncMode;
var batchMode: boolean = data.batchMode;
var batchLimit: number = data.batchLimit;
//pre-run setup
var dbObj = {
number: 123,
string: "abc",
boolean: true,
object: { var1: "var1", num1: 1 },
array: [1, 2, 3, 4]
};
var collection = db.collection("testCollection");
var startTime = moment();
//insert requested number of items, using requested settings
var allInserts: Promise<any>[] = [];
if (!batchMode) { //sequential writes
for (var i = 0; i < numWrites; i++) {
var set = collection.doc().set(dbObj);
allInserts.push(set);
if (syncMode) await set;
}
} else { //batch writes
var batch = db.batch();
for (var i = 1; i <= numWrites; i++) {
batch.set(collection.doc(), dbObj);
if (i % batchLimit === 0) {
var commit = batch.commit();
allInserts.push(commit);
batch = db.batch();
if (syncMode) await commit;
}
}
}
//some logging information. Getting items to delete
var numInserts = allInserts.length;
await Promise.all(allInserts);
var insertTime = moment();
var alldocs = (await collection.get()).docs;
var numDocs = alldocs.length;
var getTime = moment();
//deletes all of the items in the collection
var allDeletes: Promise<any>[] = [];
if (!batchMode) { //sequential deletes
for (var doc of alldocs) {
var del = doc.ref.delete();
allDeletes.push(del);
if (syncMode) await del;
}
} else { //batch deletes
var batch = db.batch();
for (var i = 1; i <= numDocs; i++) {
var doc = alldocs[i - 1];
batch.delete(doc.ref);
if (i % batchLimit === 0) {
var commit = batch.commit();
allDeletes.push(commit);
batch = db.batch();
if (syncMode) await commit;
}
}
}
var numDeletes = allDeletes.length;
await Promise.all(allDeletes);
var deleteTime = moment();
res.status(200).send(/* a whole bunch of metrics for analysis */);
});
EDIT: just to clarify, the UI does not perform these write operations, so latency between the end-user machine and cloud servers should (in theory) not cause any major latency issues. The communication to the database is handled fully by Firebase Functions
EDIT 2: I have run this test on two deployments, one in Europe and another in US. Both took around the same amount of time to run, even though my ping to these two servers are vastly different
It is normal to have faster response with the local emulator than Cloud Firestore as the remote environment adds the network traffic that takes time.
For large amounts of operations from a single source the recommendation is to use batch operations as these will reduce the transcactions, and with it Round trips.
And the reason for the Async mode to be faster is that the caller is not waiting for each transaction to be completed before sending the next one So it also makes sense that the calls are faster with it.
The Times you have on the table seem normal to me.
Just as an additional thing to optimize make sure that the region where your firestore database is located is the closest one to your location.
My application makes about 50 redis.get call to serve a single http request, it serves millions of request daily and application runs on about 30 pods.
When monitoring on newrelic i am getting 200MS average redis.get time, To Optimize this i wrote a simple pipeline system in nodejs which is simply a wrapper over redis.get and it pushes all the request in queue, and then execute the queue using redis.mget (getting all the keys in bulk).
Following is the code snippet:
class RedisBulk {
constructor() {
this.queue = [];
this.processingQueue = {};
this.intervalId = setInterval(() => {
this._processQueue();
}, 5);
}
clear() {
clearInterval(this.intervalId);
}
get(key, cb) {
this.queue.push({cb, key});
}
_processQueue() {
if (this.queue.length > 0) {
let queueLength = this.queue.length;
logger.debug('Processing Queue of length', queueLength);
let time = (new Date).getTime();
this.processingQueue[time] = this.queue;
this.queue = []; //empty the queue
let keys = [];
this.processingQueue[time].forEach((item)=> {
keys.push(item.key);
});
global.redisClient.mget(keys, (err, replies)=> {
if (err) {
captureException(err);
console.error(err);
} else {
this.processingQueue[time].forEach((item, index)=> {
item.cb(err, replies[index]);
});
}
delete this.processingQueue[time];
});
}
}
}
let redis_bulk = new RedisBulk();
redis_bulk.get('a');
redis_bulk.get('b');
redis_bulk.get('c');
redis_bulk.get('d');
My Question is: is this a good approach? will it help in optimizing redis get time? is there any other solution for above problem?
Thanks
I'm not a redis expert but judging by the documentation ;
MGET has the time complexity of
O(N) where N is the number of keys to retrieve.
And GET has the time complexity of
O(1)
Which brings both scenarios to the same end result in terms of time complexity in your scenario. Having a bulk request with MGET can bring you some improvements for the IO but apart from that looks like you have the same bottleneck.
I'd ideally split my data into chunks, responding via multiple http requests in async fashion if that's an option.
Alternatively, you can try calling GET with promise.all() to run GET requests in parallel, for all the GET calls you need.
Something like;
const asyncRedis = require("async-redis");
const client = asyncRedis.createClient();
function bulk() {
const keys = [];
return Promise.all(keys.map(client.get))
}
I am working on a Node.js application which uses the WordPress JSON API as a kind of headless CMS. When the application spins up, we query out to the WP database and pull in the information we need (using Axios), manipulate it, and store it temporarily.
Simple enough - but one of our post categories in the CMS has a rather large number of entries. For some godforsaken reason, WordPress has capped the API request limit to a maximum of 99 posts at a time, and requires that we write a loop that can send concurrent API requests until all the data has been pulled.
For instance, if we have 250 posts of some given type, I need to hit that route three separate times, specifying the specific "page" of data I want each time.
Per the docs, https://developer.wordpress.org/rest-api/using-the-rest-api/pagination/, I have access to a ?page= query string that I can use to send these requests concurrently. (i.e. ...&page=2)
I also have access to X-WP-Total in the headers object, which gives me the total number of posts within the given category.
However, these API calls are part of a nested promise chain, and the whole process needs to return a promise I can continue chaining off of.
The idea is to make it dynamic so it will always pull all of the data, and return it as one giant array of posts. Here's what I have, which is functional:
const request = require('axios');
module.exports = (request_url) => new Promise((resolve, reject) => {
// START WITH SMALL ARBITRARY REQUEST TO GET TOTAL NUMBER OF POSTS FAST
request.get(request_url + '&per_page=1').then(
(apiData) => {
// SETUP FOR PROMISE.ALL()
let promiseArray = [];
// COMPUTE HOW MANY REQUESTS WE NEED
// ALWAYS ROUND TOTAL NUMBER OF PAGES UP TO GET ALL THE DATA
const totalPages = Math.ceil(apiData.headers['x-wp-total']/99);
for (let i = 1; i <= totalPages; i++) {
promiseArray.push( request.get(`${request_url}&per_page=99&page=${i}`) )
};
resolve(
Promise.all(promiseArray)
.then((resolvedArray) => {
// PUSH IT ALL INTO A SINGLE ARRAY
let compiledPosts = [];
resolvedArray.forEach((axios_response) => {
// AXIOS MAKES US ACCESS W/RES.DATA
axios_response.data.forEach((post) => {
compiledPosts.push(post);
})
});
// RETURN AN ARRAY OF ALL POSTS REGARDLESS OF LENGTH
return compiledPosts;
}).catch((e) => { console.log('ERROR'); reject(e);})
)
}
).catch((e) => { console.log('ERROR'); reject(e);})
})
Any creative ideas to make this pattern better?
I have exactly the same question. In my case, I use Vue Resource :
this.$resource('wp/v2/media').query().then((response) => {
let pagesNumber = Math.ceil(response.headers.get('X-WP-TotalPages'));
for(let i=1; i <= pagesNumber; i++) {
this.$resource('wp/v2/media?page='+ i).query().then((response) => {
this.medias.push(response.data);
this.medias = _.flatten(this.medias);
console.log(this.medias);
});
}
I'm pretty sure there is a better workaround to achieve this.
I have an external API that rate-limits API requests to up to 25 requests per second. I want to insert parts of the results into a MongoDB database.
How can I rate limit the request function so that I don't miss any of API results for all of the array?
MongoClient.connect('mongodb://127.0.0.1:27017/test', function (err, db) {
if (err) {
throw err;
} else {
for (var i = 0; i < arr.length; i++) {
//need to rate limit the following function, without missing any value in the arr array
request( {
method: 'GET',
url: 'https://SOME_API/json?address='+arr[i]
},
function (error, response, body) {
//doing computation, including inserting to mongo
}
)
};
}
});
This could possibly be done using the request-rate-limiter package. So you can add this to your code :
var RateLimiter = require('request-rate-limiter');
const REQS_PER_MIN = 25 * 60; // that's 25 per second
var limiter = new RateLimiter(REQS_PER_MIN);
and since request-rate-limiter is based on request you can just replace request with limiter.request
You can find further information on the package's npm page - https://www.npmjs.com/package/request-rate-limiter
On a personal note - I'd replace all these callbacks with promises
You need to combine 2 things.
A throttling mechanism. I suggest _.throttle from the lodash project. This can do the rate limiting for you.
You also need an async control flow mechanism to make sure the requests run in series (don't start second one until first one is done). For that I suggest async.eachSeries
Both of these changes will be cleaner if you refactor your code to this signature:
function scrape(address, callback) {
//code to fetch a single address, do computation, and save to mongo here
//invoke the callback with (error, result) when done
}