I have 1000 files of information in MongoDB collection. I am writing a query to fetch 1000 records and in a loop, I am calling a function to download that file to local system. So, it's a sequential process to download all 1000 files.
I want some parallelism in the downloading process. In the loop, I want to download 10 files at a time, meaning I want to call download function 10 times, after completing 10 file downloads I want to download next 10 files (that means I need to call download function 10 times).
How can I achieve this parallelism OR is there any better way to do this?
I saw Kue npm, but how to achieve this? By the way I am downloading from FTP, so I am using basic-ftp npm for ftp operations.
The async library is very powerful for this, and quite easy too once you understand the basics.
I'd suggest that you use eachLimit so your app won't have to worry about looping through in batches of ten, it will just keep ten files downloading at the same time.
var files = ['a.txt', 'b.txt']
var concurrency = 10;
async.eachLimit(files, concurrency, downloadFile, onFinish);
function downloadFile(file, callback){
// run your download code here
// when file has downloaded, call callback(null)
// if there is an error, call callback('error code')
}
function onFinish(err, results){
if(err) {
// do something with the error
}
// reaching this point means the files have all downloaded
}
The async library will run downloadFile in parallel, sending each instance an entry from the files list, then when every item in the list has completed it will call onFinish.
Without seeing your implementation I can only provide a generic answer.
Let's say that your download function receives one fileId and returns a promise that resolves when said file has finished downloading. For this POC, I will mock that up with a promise that will resolve to the file name after 200 to 500 ms.
function download(fileindex) {
return new Promise((resolve,reject)=>{
setTimeout(()=>{
resolve(`file_${fileindex}`);
},200+300*Math.random());
});
}
You have 1000 files and want to download them in 100 iterations of 10 files each.
let's encapsulate stuff. I'll declare a function that receives the starting ID and a size, and returns [N...N+size] ids
function* range(bucket, size=10) {
let start = bucket*size,
end=start+size;
for (let i = start; i < end; i++) {
yield i;
}
}
You should create 100 "buckets" containing a reference to 10 files each.
let buckets = [...range(0,100)].map(bucket=>{
return [...range(bucket,10)];
});
A this point, the contents of buckets are:
[
[file0 ... file9]
...
[file 990 ... file 999]
]
Then, iterate over your buckets using for..of(which is async-capable)
On each iteration, use Promise.all to enqueue 10 calls to download
async function proceed() {
for await(let bucket of buckets) { // for...of
await Promise.all(bucket.reduce((accum,fileindex)=>{
accum.push(download(fileindex));
return accum;
},[]));
}
}
let's see a running example (just 10 buckets, we're all busy here :D )
function download(fileindex) {
return new Promise((resolve, reject) => {
let file = `file_${fileindex}`;
setTimeout(() => {
resolve(file);
}, 200 + 300 * Math.random());
});
}
function* range(bucket, size = 10) {
let start = bucket * size,
end = start + size;
for (let i = start; i < end; i++) {
yield i;
}
}
let buckets = [...range(0, 10)].map(bucket => {
return [...range(bucket, 10)];
});
async function proceed() {
let bucketNumber = 0,
timeStart = performance.now();
for await (let bucket of buckets) {
let startingTime = Number((performance.now() - timeStart) / 1000).toFixed(1).substr(-5),
result = await Promise.all(bucket.reduce((accum, fileindex) => {
accum.push(download(fileindex));
return accum;
}, []));
console.log(
`${startingTime}s downloading bucket ${bucketNumber}`
);
await result;
let endingTime = Number((performance.now() - timeStart) / 1000).toFixed(1).substr(-5);
console.log(
`${endingTime}s bucket ${bucketNumber++} complete:`,
`[${result[0]} ... ${result.pop()}]`
);
}
}
document.querySelector('#proceed').addEventListener('click',proceed);
<button id="proceed" >Proceed</button>
Related
I'm trying to understand which setup is the best for doing the following operations:
Read line by line a CSV file
Use the row data as input of a complex function that at the end outputs a file (one file for each row)
When the entire process is finished I need to zip all the files generated during step 2
My goal: fast and scalable solution able to handle huge files
I've implemented step 2 using two approaches and I'd like to know what is the best and why (or if there are other better ways)
Step 1
This is simple and I rely on CSV Parser - async iterator API:
async function* loadCsvFile(filepath, params = {}) {
try {
const parameters = {
...csvParametersDefault,
...params,
};
const inputStream = fs.createReadStream(filepath);
const csvParser = parse(parameters);
const parser = inputStream.pipe(csvParser)
for await (const line of parser) {
yield line;
}
} catch (err) {
throw new Error("error while reading csv file: " + err.message);
}
}
Step 2
Option 1
Await the long operation handleCsvLine for each line:
// step 1
const csvIterator = loadCsvFile(filePath, options);
// step 2
let counter = 0;
for await (const row of csvIterator) {
await handleCvsLine(
row,
);
counter++;
if (counter % 50 === 0) {
logger.debug(`Processed label ${counter}`);
}
}
// step 3
zipFolder(folderPath);
Pro
nice to see the files being generated one after the other
since it wait for the operation to end I can show the progress nicely
Cons
it waits for each operation, can I be faster?
Option 2
Push the long operation handleCsvLine in an array and then after the loop do Promise.all:
// step 1
const csvIterator = loadCsvFile(filePath, options);
// step 2
let counter = 0;
const promises = [];
for await (const row of csvIterator) {
promises.push(handleCvsLine(row));
counter++;
if (counter % 50 === 0) {
logger.debug(`Processed label ${counter}`);
}
}
await Promise.all(promises);
// step 3
zipFolder(folderPath);
Pro
I do not wait, so it should be faster, isn't it?
Cons
since it does not wait, the for loop is very fast but then there is a long wait at the end (aka, bad progress experience)
Step 3
A simple step in which I use the archiver library to create a zip of the folder in which I saved the files from step 2:
function zipFolder(folderPath, globPath, outputFolder, outputName, logger) {
return new Promise((resolve, reject) => {
// create a file to stream archive data to.
const stream = fs.createWriteStream(path.join(outputFolder, outputName));
const archive = archiver("zip", {
zlib: { level: 9 }, // Sets the compression level.
});
archive.glob(globPath, { cwd: folderPath });
// good practice to catch warnings (ie stat failures and other non-blocking errors)
archive.on("warning", function (err) {
if (err.code === "ENOENT") {
logger.warning(err);
} else {
logger.error(err);
reject(err);
}
});
// good practice to catch this error explicitly
archive.on("error", function (err) {
logger.error(err);
reject(err);
});
// pipe archive data to the file
archive.pipe(stream);
// listen for all archive data to be written
// 'close' event is fired only when a file descriptor is involved
stream.on("close", function () {
resolve();
});
archive.finalize();
});
}
Not using await does not make operations faster. It will not wait for the response and will move to the next operation. It will keep adding operations to the event queue, with or without await.
You should use child_process instead to mock parallel processing. Node js is not multithreaded but you can achieve it using child_process, which runs on CPU cores. This way, you can generate multiple files at a time based on number of CPU cores available in the system.
I have a collection of teams containing around 80 000 documents. Every Monday I would like to reset the scores of every team using firebase cloud functions. This is my function:
exports.resetOrgScore = functions.runWith(runtimeOpts).pubsub.schedule("every monday 00:00").timeZone("Europe/Oslo").onRun(async (context) => {
let batch = admin.firestore().batch();
let count = 0;
let overallCount = 0;
const orgDocs = await admin.firestore().collection("teams").get();
orgDocs.forEach(async(doc) => {
batch.update(doc.ref, {score:0.0});
if (++count >= 500 || ++overallCount >= orgDocs.docs.length) {
await batch.commit();
batch = admin.firestore().batch();
count = 0;
}
});
});
I tried running the function in a smaller collection of 10 documents and it's working fine, but when running the function in the "teams" collection it returns "Cannot modify a WriteBatch that has been committed". I tried returning the promise like this(code below) but that doesn't fix the problem. Thanks in advance :)
return await batch.commit().then(function () {
batch = admin.firestore().batch();
count = 0;
return null;
});
There are three problems in your code:
You use async/await with forEach() which is not recommended: The problem is that the callback passed to forEach() is not being awaited, see more explanations here or here.
As detailed in the error you "Cannot modify a WriteBatch that has been committed". With await batch.commit(); batch = admin.firestore().batch(); it's exactly what you are doing.
As important, you don't return the promise returned by the asynchronous methods. See here for more details.
You'll find in the doc (see Node.js tab) a code which allows to delete, by recursively using a batch, all the docs of a collection. It's easy to adapt it to update the docs, as follows. Note that we use a dateUpdated flag to select the docs for each new batch: with the original code, the docs were deleted so no need for a flag...
const runtimeOpts = {
timeoutSeconds: 540,
memory: '1GB',
};
exports.resetOrgScore = functions
.runWith(runtimeOpts)
.pubsub
.schedule("every monday 00:00")
.timeZone("Europe/Oslo")
.onRun((context) => {
return new Promise((resolve, reject) => {
deleteQueryBatch(resolve).catch(reject);
});
});
async function deleteQueryBatch(resolve) {
const db = admin.firestore();
const snapshot = await db
.collection('teams')
.where('dateUpdated', '==', "20210302")
.orderBy('__name__')
.limit(499)
.get();
const batchSize = snapshot.size;
if (batchSize === 0) {
// When there are no documents left, we are done
resolve();
return;
}
// Delete documents in a batch
const batch = db.batch();
snapshot.docs.forEach((doc) => {
batch.update(doc.ref, { score:0.0, dateUpdated: "20210303" });
});
await batch.commit();
// Recurse on the next process tick, to avoid
// exploding the stack.
process.nextTick(() => {
deleteQueryBatch(resolve);
});
}
Note that the above Cloud Function is configured with the maximum value for the time out, i.e. 9 minutes.
If it appears that all your docs cannot be updated within 9 minutes, you will need to find another approach, for example using the Admin SDK from one of your server, or cutting the work into pieces and run the CF several times.
My issues
Launch 1000+ online API that limits the number of API calls to 10 calls/sec.
Wait for all the API calls to give back a result (or retry), it can take 5 sec before the API sends it data
Use the combined data in the rest of my app
What I have tried while looking at a lot of different questions and answers here on the site
Use promise to wait for one API request
const https = require("https");
function myRequest(param) {
const options = {
host: "api.xxx.io",
port: 443,
path: "/custom/path/"+param,
method: "GET"
}
return new Promise(function(resolve, reject) {
https.request(options, function(result) {
let str = "";
result.on('data', function(chunk) {str += chunk;});
result.on('end', function() {resolve(JSON.parse(str));});
result.on('error', function(err) {console.log("Error: ", err);});
}).end();
});
};
Use Promise.all to do all the requests and wait for them to finish
const params = [{item: "param0"}, ... , {item: "param1000+"}]; // imagine 1000+ items
const promises = [];
base.map(function(params){
promises.push(myRequest(params.item));
});
result = Promise.all(promises).then(function(data) {
// doing some funky stuff with dat
});
So far so good, sort of
It works when I limit the number of API requests to a maximum of 10 because then the rate limiter kicks in. When I console.log(promises), it gives back an array of 'request'.
I have tried to add setTimeout in different places, like:
...
base.map(function(params){
promises.push(setTimeout(function() {
myRequest(params.item);
}, 100));
});
...
But that does not seem to work. When I console.log(promises), it gives back an array of 'function'
My questions
Now I am stuck ... any ideas?
How do I build in retries when the API gives an error
Thank you for reading up to hear, you are already a hero in my book!
When you have a complicated control-flow using async/await helps a lot to clarify the logic of the flow.
Let's start with the following simple algorithm to limit everything to 10 requests per second:
make 10 requests
wait 1 second
repeat until no more requests
For this the following simple implementation will work:
async function rateLimitedRequests (params) {
let results = [];
while (params.length > 0) {
let batch = [];
for (i=0; i<10; i++) {
let thisParam = params.pop();
if (thisParam) { // use shift instead
batch.push(myRequest(thisParam.item)); // of pop if you want
} // to process in the
// original order.
}
results = results.concat(await Promise.all(batch));
await delayOneSecond();
}
return results;
}
Now we just need to implement the one second delay. We can simply promisify setTimeout for this:
function delayOneSecond() {
return new Promise(ok => setTimeout(ok, 1000));
}
This will definitely give you a rate limiter of just 10 requests each second. In fact it performs somewhat slower than that because each batch will execute in request time + one second. This is perfectly fine and already meet your original intent but we can improve this to squeeze a few more requests to get as close as possible to exactly 10 requests per second.
We can try the following algorithm:
remember the start time
make 10 requests
compare end time with start time
delay one second minus request time
repeat until no more requests
Again, we can use almost exactly the same logic as the simple code above but just tweak it to do time calculations:
const ONE_SECOND = 1000;
async function rateLimitedRequests (params) {
let results = [];
while (params.length > 0) {
let batch = [];
let startTime = Date.now();
for (i=0; i<10; i++) {
let thisParam = params.pop();
if (thisParam) {
batch.push(myRequest(thisParam.item));
}
}
results = results.concat(await Promise.all(batch));
let endTime = Date.now();
let requestTime = endTime - startTime;
let delayTime = ONE_SECOND - requestTime;
if (delayTime > 0) {
await delay(delayTime);
}
}
return results;
}
Now instead of hardcoding the one second delay function we can write one that accept a delay period:
function delay(milliseconds) {
return new Promise(ok => setTimeout(ok, milliseconds));
}
We have here a simple, easy to understand function that will rate limit as close as possible to 10 requests per second. It is rather bursty in that it makes 10 parallel requests at the beginning of each one second period but it works. We can of course keep implementing more complicated algorithms to smooth out the request pattern etc. but I leave that to your creativity and as homework for the reader.
My application makes about 50 redis.get call to serve a single http request, it serves millions of request daily and application runs on about 30 pods.
When monitoring on newrelic i am getting 200MS average redis.get time, To Optimize this i wrote a simple pipeline system in nodejs which is simply a wrapper over redis.get and it pushes all the request in queue, and then execute the queue using redis.mget (getting all the keys in bulk).
Following is the code snippet:
class RedisBulk {
constructor() {
this.queue = [];
this.processingQueue = {};
this.intervalId = setInterval(() => {
this._processQueue();
}, 5);
}
clear() {
clearInterval(this.intervalId);
}
get(key, cb) {
this.queue.push({cb, key});
}
_processQueue() {
if (this.queue.length > 0) {
let queueLength = this.queue.length;
logger.debug('Processing Queue of length', queueLength);
let time = (new Date).getTime();
this.processingQueue[time] = this.queue;
this.queue = []; //empty the queue
let keys = [];
this.processingQueue[time].forEach((item)=> {
keys.push(item.key);
});
global.redisClient.mget(keys, (err, replies)=> {
if (err) {
captureException(err);
console.error(err);
} else {
this.processingQueue[time].forEach((item, index)=> {
item.cb(err, replies[index]);
});
}
delete this.processingQueue[time];
});
}
}
}
let redis_bulk = new RedisBulk();
redis_bulk.get('a');
redis_bulk.get('b');
redis_bulk.get('c');
redis_bulk.get('d');
My Question is: is this a good approach? will it help in optimizing redis get time? is there any other solution for above problem?
Thanks
I'm not a redis expert but judging by the documentation ;
MGET has the time complexity of
O(N) where N is the number of keys to retrieve.
And GET has the time complexity of
O(1)
Which brings both scenarios to the same end result in terms of time complexity in your scenario. Having a bulk request with MGET can bring you some improvements for the IO but apart from that looks like you have the same bottleneck.
I'd ideally split my data into chunks, responding via multiple http requests in async fashion if that's an option.
Alternatively, you can try calling GET with promise.all() to run GET requests in parallel, for all the GET calls you need.
Something like;
const asyncRedis = require("async-redis");
const client = asyncRedis.createClient();
function bulk() {
const keys = [];
return Promise.all(keys.map(client.get))
}
So i have created server which collect data and write it into db in never ending loop.
server.listen(3001, () => {
doFullScan();
});
async function doFullScan() {
while (true) {
await collectAllData();
}
}
collectAllData() is a method which check for available projects, loop through each project collect some data and write it into db.
async function collectAllData() {
//doing soemhting
const projectNames = ['array with projects name'];
//this loop takes too much of time
for(let project in projectNames){
await collectProjectData(project);
}
//doing something
}
The problem is that whole loop is taking too much time. So i would like to speed it up by multithreading loop and use all of my computer cores on it.
How should i do it?
There is cluster library with examples on https://nodejs.org/docs/latest/api/cluster.html but i don't want to create new servers. I want to spawn childrens, which will do a task and exit after they have done its job.
So there is const { fork } = require('child_process'); but I'm not exactly sure how to make each fork run only collectProjectData() method.
You can do it natively without any third party libraries.
Right now, your for...loop is running each one after the other.
Option 1
Use Promise.all and .map
await Promise.all(projectNames.map(async(projectName) => {
await collectProjectData(projectName);
});
Note, if you use .map, it will kick off all of them, all at the same time, which might be too much if projectNames continue to grow.
This is the complete opposite of what yours is doing currently.
Option 2
There is a middle way...running batches in sequence, but items inside each batch asynchronously.
const chunk = (a, l) => a.length === 0 ? [] : [a.slice(0, l)].concat(chunk(a.slice(l), l));
const batchSize = 10;
const projectNames = ['array with projects name'];
let projectNamesInChunks = chunk(projectNames, batchSize);
for(let chunk of projectNamesInChunks){
await Promise.all(chunk.map(async(projectName) => {
await collectProjectData(projectName);
});
}
I recommend using Promise.map
http://bluebirdjs.com/docs/api/promise.map.html
that way you can control the level of concurrency as you wish like this:
await Promise.map(projectNames, collectProjectData, {concurrency: 3})