How to send large number of requests from axios? - node.js

I want to send about 2000 api calls to retrieve data from some api. How do i archive this without having any bottlenecks using axios?

2000 api calls in parallel or one after another?
it would be really resource intensive if it is in parallel, depending on your device it might me possible or maybe not.
but if it is one after another it can be easily achieved with a simple loop.
here is what I would do in case of parallel implementation:
this would be a little trial and error in the beginning to find the best parallelism factor.
use lodash chunks to make the complete 2000 requests to chunks of size x api request objects
(reference: https://www.geeksforgeeks.org/lodash-_-chunk-method/)
now for each chunk, we will make api calls in parallel. that is each chunk will be in sequential order but within each chuck we will make parallel api calls.
sample code:
// Requiring the lodash module
// in the script
const _ = require("lodash");
async function _2000API_CALLS() {
const reqs = [... your 2000 request params]
// Making chunks of size 100
const chunks = _.chunk(reqs, 100) // x = 100
for(const c of chunks) {
const responses = await Promise.all(c.map(() => {
return axois... // api call logic
}))
// accumulate responses in some array if u need to store it and use it later
}
}

Related

Pass large array of objects to RabbitMQ exchange

I receive large array of objects from an external source (about more than 10 000 objects). And then I pass it to exchange in order to notify other microservices about new entries to handle.
this._rmqClient.publishToExchange({
exchange: 'my-exchange',
exchangeOptions: {
type: 'fanout',
durable: true,
},
data: myData, // [object1, object2, object3, ...]
pattern: 'myPattern',
})
The problem is that it's bad practice to push such large message to exchange, and I'd like to resolve this issue. I've read articles and stackoverflow posts about that to find code example or information about streaming data but with no success.
The only way I've found out is to divide my large array into chunks and publish each one to exchange using for ... loop. Is it good practice? How to determine what length should each chunk (number of objects) have? Or maybe is there another approach?
It really depends on the Object size.. That's a thing you would have to figure out yourself. Get your 10k objects and calculate an average size out of them (Put them as json into a file and take fileSize/10'000 that's it. Maybe request body size of 50-100kb is a good thing? But that's still up to u ..
Start with number 50 and do tests. Check the time taken, bandwidth and everything what makes sense. Change chunk sizes from between 1-5000 and test test test . At some point, you will get a feeling what number would be good to take! .
Here's some example code of looping through the elements:
// send function for show caseing the idea.
function send(data) {
return this._rmqClient.publishToExchange({
exchange: 'my-exchange',
exchangeOptions: {
type: 'fanout',
durable: true,
},
data: data,
pattern: 'myPattern',
})
}
// this sends chunks one by one..
async function sendLargeDataPacket(data, chunkSize) {
// Pure functions do prevent headache
const mutated = [...data]
// send full packages aslong as possible!.
while (mutated.length >= chunkSize) {
// send a packet of chunkSize length
await send(mutated.splice(0, chunkSize))
}
// send the remaining elements if there are any!
if(mutated.length > 0) {
await send(mutated)
}
}
And you would call it like:
// that's your 10k+ items array!.
var myData = [/**...**/]
// let's start with 50, but try out all numbers!.
const chunkSize = 50
sendLargeDataPacket(chunkSize).then(() => console.log('done')).catch(console.error)
This approach send one packet after the other, and may take some time since it is not done in parallel. I do not know your requirements but I can help you writing a parallel approach if you need..

Any suggestions about how to publish a huge amount of messages within one round of request / response?

If I publish 50K messages using Promise.all like below:
const pubsub = new PubSub({ projectId: PUBSUB_PROJECT_ID });
const topic = pubsub.topic(topicName, {
batching: {
maxMessages: 1000,
maxMilliseconds: 100,
},
});
const n = 50 * 1000;
const dataBufs: Buffer[] = [];
for (let i = 0; i < n; i++) {
const data = `message payload ${i}`;
const dataBuffer = Buffer.from(data);
dataBufs.push(dataBuffer);
}
const tasks = dataBufs.map((d, idx) =>
topic.publish(d).then((messageId) => {
console.log(`[${new Date().toISOString()}] Message ${messageId} published. index: ${idx}`);
})
);
// publish messages concurrencly
await Promise.all(tasks);
// send response to front-end
res.json(data);
I will hit this issue: pubsub-emulator throw error and publisher throw "Retry total timeout exceeded before any response was received" when publish 50k messages
If I use for loop and async/await. The issue is gone.
const n = 50 * 1000;
for (let i = 0; i < n; i++) {
const data = `message payload ${i}`;
const dataBuffer = Buffer.from(data);
const messageId = await topic.publish(dataBuffer)
console.log(`[${new Date().toISOString()}] Message ${messageId} published. index: ${i}`)
}
// some logic ...
// send response to front-end
res.json(data);
But it will block the execution of subsequent logic because of async/await until all messages have been published. It takes a long time to post 50k messages.
Any suggestions about how to publish a huge amount of messages(about 50k) without blocking the execution of subsequent logic? Do I need to use child_process or some queue like bull to publish the huge amount of messages in the background without blocking request/response workflow of the API? This means I need to respond to the front-end as soon as possible, the 50k messages should be the background tasks.
It seems there is a memory queue inside #google/pubsub library. I am not sure if I should use another queue like bull again.
The time it will take to publish large amounts of data depends on a lot of factors:
Message size. The larger the messages, the longer it takes to send them.
Network capacity (both of the connection between wherever the publisher is running and Google Cloud and, if relevant, of the virtual machine itself). This puts an upper bound on the amount of data that can be transmitted. It is not atypical to see smaller virtual machines with limits in the 40MB/s range. Note that if you are testing via Wifi, the limits could be even lower than this.
Number of threads and number of CPU cores. When having to run a lot of asynchronous callbacks, the ability to schedule them to run can be limited by the parallel capacity of the machine or runtime environment.
Typically, it is not good to try to send 50,000 publishes simultaneously from one instance of a publisher. It is likely that the above factors will cause the client to get overloaded and result in deadline exceeded errors. The best way to prevent this is to limit the number of messages that can be outstanding for publish at one time. Some of the libraries like Java support this natively. The Node.js library does not yet support this feature, but likely will in the future.
In the meantime, you'd want to keep a counter of the number of messages outstanding and limit it to whatever the client seems to be able to handle. Start with 1000 and work up or down from there based on the results. A semaphore would be a pretty standard way to achieve this behavior. In your case the code would look something like this:
var sem = require('semaphore')(1000);
var publishes = []
const tasks = dataBufs.map((d, idx) =>
sem.take(function() => {
publishes.push(topic.publish(d).then((messageId) => {
console.log(`[${new Date().toISOString()}] Message ${messageId} published. index: ${idx}`);
sem.leave();
}));
})
);
// Await the start of publishing all messages
await Promise.all(tasks);
// Await the actual publishes
await Promise.all(publishes);

Dispatching up to max parallel REST calls in node.js / how does await work in node

I'm using node.js, have a graph of dependent REST calls and am trying to dispatch them in parallel. It's part of a testing/load testing script.
My graph, has "connected components", and each component is directed and acyclic. I toposort each component, so I end up with a graph that looks like this
Component1 = [Call1, Call2...., Callm] (Call2 possibly dependent on call1 etc)
Component2 = [Call1, Call2... Calln]
...
Componentp
The number of components, and calls in each component m, n and p are dynamic
I want to round robin over the components, and each of it's calls, dispatching up to "x" calls concurrently.
Whilst I understand a little about Promises, async/await and Node's event loop I'm NOT an expert.
PSEUDO CODE ONLY
maxParallel = x
runningCallCount = 0
while(components.some(calls => calls.some(call => noResponseYet(call)) {
if (runningCallCount < maxParallel) {
runningCallCount++
var result = await axios(call)
runningCallCount--
}
}
This doesn't work - I never dispatch the calls.
Remove the await and i fall through to the runningCallCount-- straight away.
Other approaches I've tried and comments
Wrapping every call in an async function, and using Promise.All on a chunk of x number at a time - a chunking style of approach. This may work, but It doesn't acheive the result of allways trying to have x parallel calls going
Used RxJs - tried merge on all components with a max number of parallelism - but this parallelises the components, not the calls within the components, and i couldn't work out how to
make it work the way i wanted based on the poor doco. I'd used the .NET version before so this was a bit disappointing.
I haven't yet tried recursion
Can anyone chime in with an idea as to how to do this ?
How does await work in node ? I've seen it explained like generator functions and yield statements (https://medium.com/siliconwat/how-javascript-async-await-works-3cab4b7d21da)
Can anyone add detail - how is the event loop checked when code strikes an await call ? Again I'm guessing either the entire stack unrolls, or a call to run the event loop is somehow inserted by
the await call.
I'm not interested in using a load testing package, or other load testing tools - I just want to understand the best way to do this, but also understand what's going on in node and await.
I'll update this if i understand this or find a solution, but
Help appreciated.
I would think something like this would work to achieve always having n parallel calls going.
const delay = time => new Promise(r=>setTimeout(r,time));
let maxJobs = 4;
let jobQueue = [
{time:1000},{time:3000},{time:1000},{time:2000},
{time:1000},{time:1000},{time:2000},{time:1000},
{time:1000},{time:5000},{time:1000},{time:1000},
{time:1000},{time:7000},{time:1000},{time:1000}
];
jobQueue.forEach((e,i)=>e.id=i);
const jobProcessor = async function(){
while(jobQueue.length>0){
let job = jobQueue.pop();
console.log('Starting id',job.id);
await delay(job.time);
console.log('Finished id',job.id);
}
return;
};
(async ()=>{
console.log("Starting",new Date());
await Promise.all([...Array(maxJobs).keys()].map(e=>jobProcessor()))
console.log("Finished",new Date());
})();

Express Node Request For Loop Issue [duplicate]

With node.js I want to http.get a number of remote urls in a way that only 10 (or n) runs at a time.
I also want to retry a request if an exception occures locally (m times), but when the status code returns an error (5XX, 4XX, etc) the request counts as valid.
This is really hard for me to wrap my head around.
Problems:
Cannot try-catch http.get as it is async.
Need a way to retry a request on failure.
I need some kind of semaphore that keeps track of the currently active request count.
When all requests finished I want to get the list of all request urls and response status codes in a list which I want to sort/group/manipulate, so I need to wait for all requests to finish.
Seems like for every async problem using promises are recommended, but I end up nesting too many promises and it quickly becomes uncypherable.
There are lots of ways to approach the 10 requests running at a time.
Async Library - Use the async library with the .parallelLimit() method where you can specify the number of requests you want running at one time.
Bluebird Promise Library - Use the Bluebird promise library and the request library to wrap your http.get() into something that can return a promise and then use Promise.map() with a concurrency option set to 10.
Manually coded - Code your requests manually to start up 10 and then each time one completes, start another one.
In all cases, you will have to manually write some retry code and as with all retry code, you will have to very carefully decide which types of errors you retry, how soon you retry them, how much you backoff between retry attempts and when you eventually give up (all things you have not specified).
Other related answers:
How to make millions of parallel http requests from nodejs app?
Million requests, 10 at a time - manually coded example
My preferred method is with Bluebird and promises. Including retry and result collection in order, that could look something like this:
const request = require('request');
const Promise = require('bluebird');
const get = Promise.promisify(request.get);
let remoteUrls = [...]; // large array of URLs
const maxRetryCnt = 3;
const retryDelay = 500;
Promise.map(remoteUrls, function(url) {
let retryCnt = 0;
function run() {
return get(url).then(function(result) {
// do whatever you want with the result here
return result;
}).catch(function(err) {
// decide what your retry strategy is here
// catch all errors here so other URLs continue to execute
if (err is of retry type && retryCnt < maxRetryCnt) {
++retryCnt;
// try again after a short delay
// chain onto previous promise so Promise.map() is still
// respecting our concurrency value
return Promise.delay(retryDelay).then(run);
}
// make value be null if no retries succeeded
return null;
});
}
return run();
}, {concurrency: 10}).then(function(allResults) {
// everything done here and allResults contains results with null for err URLs
});
The simple way is to use async library, it has a .parallelLimit method that does exactly what you need.

Node.js Synchronous Library Code Blocking Async Execution

Suppose you've got a 3rd-party library that's got a synchronous API. Naturally, attempting to use it in an async fashion yields undesirable results in the sense that you get blocked when trying to do multiple things in "parallel".
Are there any common patterns that allow us to use such libraries in an async fashion?
Consider the following example (using the async library from NPM for brevity):
var async = require('async');
function ts() {
return new Date().getTime();
}
var startTs = ts();
process.on('exit', function() {
console.log('Total Time: ~' + (ts() - startTs) + ' ms');
});
// This is a dummy function that simulates some 3rd-party synchronous code.
function vendorSyncCode() {
var future = ts() + 50; // ~50 ms in the future.
while(ts() <= future) {} // Spin to simulate blocking work.
}
// My code that handles the workload and uses `vendorSyncCode`.
function myTaskRunner(task, callback) {
// Do async stuff with `task`...
vendorSyncCode(task);
// Do more async stuff...
callback();
}
// Dummy workload.
var work = (function() {
var result = [];
for(var i = 0; i < 100; ++i) result.push(i);
return result;
})();
// Problem:
// -------
// The following two calls will take roughly the same amount of time to complete.
// In this case, ~6 seconds each.
async.each(work, myTaskRunner, function(err) {});
async.eachLimit(work, 10, myTaskRunner, function(err) {});
// Desired:
// --------
// The latter call with 10 "workers" should complete roughly an order of magnitude
// faster than the former.
Are fork/join or spawning worker processes manually my only options?
Yes, it is your only option.
If you need to use 50ms of cpu time to do something, and need to do it 10 times, then you'll need 500ms of cpu time to do it. If you want it to be done in less than 500ms of wall clock time, you need to use more cpus. That means multiple node instances (or a C++ addon that pushes the work out onto the thread pool). How to get multiple instances depends on your app strucuture, a child that you feed the work to using child_process.send() is one way, running multiple servers with cluster is another. Breaking up your server is another way. Say its an image store application, and mostly is fast to process requests, unless someone asks to convert an image into another format and that's cpu intensive. You could push the image processing portion into a different app, and access it through a REST API, leaving the main app server responsive.
If you aren't concerned that it takes 50ms of cpu to do the request, but instead you are concerned that you can't interleave handling of other requests with the processing of the cpu intensive request, then you could break the work up into small chunks, and schedule the next chunk with setInterval(). That's usually a horrid hack, though. Better to restructure the app.

Resources