Handling of large array of items in nodejs - node.js

I have array of more then 100 items. Each item hits a url. After getting response from the specific url of the item I do insert option in the database.
If I use async await, then I have to wait till the 100 items get processd. But if I do asynchronoulsy, then I got response but as asynchronus behavior the data in db will not be visible untill all process get completed.
Example 1: using async await
var loc = async(locations){
return new Promise((resolve,reject) => {
try{
for(let item of locations){
var response = await html(location.url);
await insertDB(response);
}
}catch(e){
console.log('error : ',e);
}
resolve('All operation completed successfully');
})
}
loc(locations);
locations is array of items
In this I will get response only when all the requests get completed.
Example 2: asynchronously
var loc = (locations){
return new Promise((resolve,reject) => {
locations.forEach(location => {
html(location.url,(res) => {
insertDB(response,(rows) => {
});
});
})
resolve('All operation completed successfully');
})
}
loc(locations)
Here I will immediately get the response, but the process on hitting and insertion will be processing in the background.
Considering example 1, is it possible that we can divide the looping request into child process so that make it execute fast in nodejs or any other way to do it?

Is this what you want? It will work like your 2nd solution, but it will be resolved after every insert complete
var loc = (locations){
return new Promise((resolve,reject) => {
const promises = locations.map(async (location) => {
try {
var response = await html(location.url);
await insertDB(response);
} catch(e) {
console.log('error : ',e);
}
})
await Promise.all(promises);
resolve('All operation completed successfully');
})
}
Should have difference. Can you do me a favor to test the performance time.
solution1() as your first solution function
solution2() as my solution function
console.time('solution1');
await solution1(locations);
console.timeEnd('solution1');
console.time('solution2');
await solution2(locations);
console.timeEnd('solution2');
assume insertDB cost 100ms ~ 200ms, avg = 150ms
Solution 1:
run loop 1, wait insertDB -> run loop 2, wait insertDB
time = 150ms + 150ms = 300ms
Solution 2:
run loop 1, without wait insertDB -> run loop 2, without wait insertDB
till all insertDB complete, then resolve
time = 150ms
because all insertDB start without waiting another insertDB
credit: https://stackoverflow.com/a/56993392/13703967
test: https://codepen.io/hjian0329/pen/oNxdPoZ

Related

Nodejs `fs.createReadStream` as promise

Im trying to get fs.createReadStream working as a promise, so after the entire file has been read, it will be resolved.
In the case bellow, im pausing the stream, executing the awaitable method and resuming.
How to make .on('end'... to be be executed in the end.
if 1. is not possible, why the `.on('wont be fired', maybe i can use it to resolve the promise.
function parseFile<T>(filePath: string, row: (x: T) => void, err: (x) => void, end: (x) => void) {
return new Promise((resolve, reject) => {
const stream = fs.createReadStream(filePath);
stream.on('data', async data => {
try {
stream.pause();
await row(data);
} finally {
stream.resume();
}
})
.on('end', (rowCount: number) => {
resolve();// NOT REALLY THE END row(data) is still being called after this
})
.on('close', () => {
resolve();// NEVER BEING CALLED
})
.on('error', (rowCount: number) => {
reject();// NEVER GETS HERE, AS EXPECTED
})
})
}
UPDATE
Here you can actually test it: https://stackblitz.com/edit/node-czktjh?file=index.js
run node index.js
The output should be 1000 and not 1
Thanks
Something to be aware of. You've removed the line processing from the current version of the question so the stream is being read in large chunks. It appears to be reading the entire file in just two chunks, thus just two data events so the expected count here is 2, not 1000.
I think the problem with this code occurs because stream.pause() does not pause the generation of the end event - it only pauses future data events. If the last data event has been fired and you then await inside the processing of that data event (which causes your data event handler to immediately return a promise, the stream will think it's done and the end event will still fire before you're done awaiting the function inside the processing of that last data event. Remember, the data event handler is NOT promise-aware. And, it appears that stream.pause() only affects data events, not the end event.
I can imagine a work-around with a flag that keeps track of whether you're still processing a data event and postpones processing the end event until you're done with that last data event. I will add code for that in a second that illustrates how to use the flag.
FYI, the missing close event is another stream weirdness. Your nodejs program actually terminates before the close event gets to fire. If you put this at the start of your program:
setTimeout(() => { console.log('done with timer');}, 5000);
Then, you will see the close event because the timer will prevent your nodejs program from exiting before the close event gets to fire. I'm not suggesting this as a solution to any problem, just to illustrate that the close event is still there and wants to fire if your program doesn't exit before it gets a chance.
Here's code that demonstrated the use of flags to work-around the pause issue. When you run this code, you will only see 2 data events, not 1000 because this code is not reading lines, it's reading much larger chunks that that. So, the expected result of this is not 1000.
// run `node index.js` in the terminal
const fs = require('fs');
const parseFile = row => {
let paused = true;
let ended = false;
let dataCntr = 0;
return new Promise((resolve, reject) => {
const stream = fs.createReadStream('./generated.data.csv');
stream
.on('data', async data => {
++dataCntr;
try {
stream.pause();
paused = true;
await row(data);
} finally {
paused = false;
stream.resume();
if (ended) {
console.log(`received ${dataCntr} data events`);
resolve();
}
}
})
.on('end', rowCount => {
ended = true;
if (!paused) {
console.log(`received ${dataCntr} data events`);
resolve();
}
})
.on('close', () => {
//resolve();
})
.on('error', rowCount => {
reject();
});
});
};
(async () => {
let count = 0;
await parseFile(async row => {
await new Promise(resolve => setTimeout(resolve, 50)); //sleep
count++;
});
console.log(`lines executed: ${count}, the expected is more than 1`);
})();
FYI, I still think your original version of the question had the problem I mentioned in my first comment - that you weren't pausing the right stream. What is documented here is yet another problem (where you can get end before your await in the last data event is done).

Trigger the execution of a function if any condition is met

I'm writing an HTTP API with expressjs in Node.js and here is what I'm trying to achieve:
I have a regular task that I would like to run regularly, approx every minute. This task is implemented with an async function named task.
In reaction to a call in my API I would like to have that task called immediately as well
Two executions of the task function must not be concurrent. Each execution should run to completion before another execution is started.
The code looks like this:
// only a single execution of this function is allowed at a time
// which is not the case with the current code
async function task(reason: string) {
console.log("do thing because %s...", reason);
await sleep(1000);
console.log("done");
}
// call task regularly
setIntervalAsync(async () => {
await task("ticker");
}, 5000) // normally 1min
// call task immediately
app.get("/task", async (req, res) => {
await task("trigger");
res.send("ok");
});
I've put a full working sample project at https://github.com/piec/question.js
If I were in go I would do it like this and it would be easy, but I don't know how to do that with Node.js.
Ideas I have considered or tried:
I could apparently put task in a critical section using a mutex from the async-mutex library. But I'm not too fond of adding mutexes in js code.
Many people seem to be using message queue libraries with worker processes (bee-queue, bullmq, ...) but this adds a dependency to an external service like redis usually. Also if I'm correct the code would be a bit more complex because I need a main entrypoint and an entrypoint for worker processes. Also you can't share objects with the workers as easily as in a "normal" single process situation.
I have tried RxJs subject in order to make a producer consumer channel. But I was not able to limit the execution of task to one at a time (task is async).
Thank you!
You can make your own serialized asynchronous queue and run the tasks through that.
This queue uses a flag to keep track of whether it's in the middle of running an asynchronous operation already. If so, it just adds the task to the queue and will run it when the current operation is done. If not, it runs it now. Adding it to the queue returns a promise so the caller can know when the task finally got to run.
If the tasks are asynchronous, they are required to return a promise that is linked to the asynchronous activity. You can mix in non-asynchronous tasks too and they will also be serialized.
class SerializedAsyncQueue {
constructor() {
this.tasks = [];
this.inProcess = false;
}
// adds a promise-returning function and its args to the queue
// returns a promise that resolves when the function finally gets to run
add(fn, ...args) {
let d = new Deferred();
this.tasks.push({ fn, args: ...args, deferred: d });
this.check();
return d.promise;
}
check() {
if (!this.inProcess && this.tasks.length) {
// run next task
this.inProcess = true;
const nextTask = this.tasks.shift();
Promise.resolve(nextTask.fn(...nextTask.args)).then(val => {
this.inProcess = false;
nextTask.deferred.resolve(val);
this.check();
}).catch(err => {
console.log(err);
this.inProcess = false;
nextTask.deferred.reject(err);
this.check();
});
}
}
}
const Deferred = function() {
if (!(this instanceof Deferred)) {
return new Deferred();
}
const p = this.promise = new Promise((resolve, reject) => {
this.resolve = resolve;
this.reject = reject;
});
this.then = p.then.bind(p);
this.catch = p.catch.bind(p);
if (p.finally) {
this.finally = p.finally.bind(p);
}
}
let queue = new SerializedAsyncQueue();
// utility function
const sleep = function(t) {
return new Promise(resolve => {
setTimeout(resolve, t);
});
}
// only a single execution of this function is allowed at a time
// so it is run only via the queue that makes sure it is serialized
async function task(reason: string) {
function runIt() {
console.log("do thing because %s...", reason);
await sleep(1000);
console.log("done");
}
return queue.add(runIt);
}
// call task regularly
setIntervalAsync(async () => {
await task("ticker");
}, 5000) // normally 1min
// call task immediately
app.get("/task", async (req, res) => {
await task("trigger");
res.send("ok");
});
Here's a version using RxJS#Subject that is almost working. How to finish it depends on your use-case.
async function task(reason: string) {
console.log("do thing because %s...", reason);
await sleep(1000);
console.log("done");
}
const run = new Subject<string>();
const effect$ = run.pipe(
// Limit one task at a time
concatMap(task),
share()
);
const effectSub = effect$.subscribe();
interval(5000).subscribe(_ =>
run.next("ticker")
);
// call task immediately
app.get("/task", async (req, res) => {
effect$.pipe(
take(1)
).subscribe(_ =>
res.send("ok")
);
run.next("trigger");
});
The issue here is that res.send("ok") is linked to the effect$ streams next emission. This may not be the one generated by the run.next you're about to call.
There are many ways to fix this. For example, you can tag each emission with an ID and then wait for the corresponding emission before using res.send("ok").
There are better ways too if calls distinguish themselves naturally.
A Clunky ID Version
Generating an ID randomly is a bad idea, but it gets the general thrust across. You can generate unique IDs however you like. They can be integrated directly into the task somehow or can be kept 100% separate the way they are here (task itself has no knowledge that it's been assigned an ID before being run).
interface IdTask {
taskId: number,
reason: string
}
interface IdResponse {
taskId: number,
response: any
}
async function task(reason: string) {
console.log("do thing because %s...", reason);
await sleep(1000);
console.log("done");
}
const run = new Subject<IdTask>();
const effect$: Observable<IdResponse> = run.pipe(
// concatMap only allows one observable at a time to run
concatMap((eTask: IdTask) => from(task(eTask.reason)).pipe(
map((response:any) => ({
taskId: eTask.taskId,
response
})as IdResponse)
)),
share()
);
const effectSub = effect$.subscribe({
next: v => console.log("This is a shared task emission: ", v)
});
interval(5000).subscribe(num =>
run.next({
taskId: num,
reason: "ticker"
})
);
// call task immediately
app.get("/task", async (req, res) => {
const randomId = Math.random();
effect$.pipe(
filter(({taskId}) => taskId == randomId),
take(1)
).subscribe(_ =>
res.send("ok")
);
run.next({
taskId: randomId,
reason: "trigger"
});
});

Run node.js functions in parallel

I am new to Node.js and I wanted some functions to run simultaneously.
I have seen several articles and as far as I understood I can use Promise.all and Promise.allSettled.
I don't understand why my functions run sequentially, here's the code I arranged.
async processPropositions(proposition) {
const dataWords = [];
const start = Date.now()
//async functions that return promises
const invariableData = this.invariables(propositionSplitted);
const nounsData = this.nouns(propositionSplitted);
const adjectivesData = this.adjectives(propositionSplitted);
const verbsData = this.verbs(propositionSplitted);
//here the code should stop until every promise is resolved
const [invariables, nouns, adjectives, verbs] = await Promise.all([invariableData, nounsData, adjectivesData, verbsData]);
//I find this time == to the sum of the time printed in the single functions abo
const finish = Date.now()
const time = finish - start
console.log(time)
//here in the original function I append results to dataWords
return dataWords;
}
I have printed the time of the single async functions (i.e. this.invariables(propositionSplitted);, this.nouns(propositionSplitted);, this.adjectives(propositionSplitted);, this.verbs(propositionSplitted);) and their sum is equal to the time I'm printing with this function.
I have the same problem when I try to run the main function for every proposition of the array propositions with a for loop. Since they're indipendent I'd like to run them simultaneously and then collecting results when every promise is solved.
I tried this but obviously I'm missing a fundamental concept of asynchronous coding:
for (let proposition of propositions) {
results.push(this.processPropositions(proposition));
for (let result of results) {
dataWords.push(await result);
if (propositions.length > 1 && results.indexOf(result) < conjunctions.length) dataWords.push(Conjunctions.getConjunction(conjunctions[results.indexOf(result)]));
}
}
If I don't await the for loop finishes before it receives the premise, while if I keep the await it becomes synchronous.
The inner loop isn't waiting for the promises to resolve. Load up all the promises like you are doing in the outer loop, but don't check for resolutions until after they are fulfilled. Then you can loop through the results.
function myPromise(duration) {
return new Promise((resolve) => {
setTimeout(() => {
resolve(duration);
}, duration);
});
}
const propositions = [ 200, 500, 4000 ];
const promises = [];
for(const proposition of propositions) {
promises.push(myPromise(proposition));
}
const start = Date.now();
Promise.all(promises).then(results => {
const totalTime = Date.now() - start;
console.log(`all timers finished in ${totalTime}`);
results.forEach((t, i) => {
console.log(` timer ${i} took ${t}`);
});
});

Chunking axios.get requests with a 1 second delay per chunk - presently getting 429 error

I have a script using axios that hits an API with a limit of 5 requests per second. At present my request array length is 72 and will grow over time. I receive an abundance of 429 errors. The responses per endpoint change with each run of the script; ex: url1 on iteration1 returns 429, then url1 on iteration2 returns 200, url1 on iteration3 returns 200, url1 on iteration4 returns 429.
Admittedly my understanding of async/await and promises are spotty at best.
What I understand:
I can have multiple axios.get running because of async. The variable I set in my main that uses the async function can include the await to ensure all requests have processed before continuing the script.
Promise.all can run multiple axios.gets but, if a single request fails the chain breaks and no more requests will run.
Because the API will only accept 5 requests per second I have to chunk my axios.get requests to 5 endpoints, wait for those to finish processing before sending the next chunk of 5.
setTimeout will assign a time limit to a single request, once the time is up the request is done and will not be sent again no matter the return being other than 200.
setInterval will assign a time limit but it will send the request again after time's up and keep requesting until it receives a 200.
async function main() {
var endpoints = makeEndpoints(boards, whiteList); //returns an array of string API endpoints ['www.url1.com', 'www.url2.com', ...]
var events = await getData(endpoints);
...
}
The getData() has seen many iterations in attempt to correct the 429's. Here are a few:
// will return the 200's sometimes and not others, I believe it's the timeout but that won't attempt the hit a failed url (as I understand it)
async function getData(endpoints) {
let events = [];
for (x = 0; x < endpoints.length; x++) {
try {
let response = await axios.get(endpoints[x], {timeout: 2000});
if ( response.status == 200 &&
response.data.hasOwnProperty('_embedded') &&
response.data._embedded.hasOwnProperty('events')
) {
let eventsArr = response.data._embedded.events;
eventsArr.forEach(event => {
events.push(event)
});
}
} catch (error) {
console.log(error);
}
}
return events;
}
// returns a great many 429 errors via the setInterval, as I understand this function sets a delay of N seconds before attempting the next call
async function getData(endpoints) {
let data = [];
let promises = [];
endpoints.forEach((url) => {
promises.push(
axios.get(url)
)
})
setInterval(function() {
for (i = 0; i < promises.length; i += 5) {
let requestArr = promises.slice(i, i + 5);
axios.all(requestArr)
.then(axios.spread((...res) => {
console.log(res);
}))
.catch(err => {
console.log(err);
})
}
}, 2000)
}
// Here I hoped Promise.all would allow each request to do its thing and return the data, but after further reading I found that if a single request fails the rest will fail in the Promise.all
async getData(endpoints) {
try {
const res = await Promise.all(endpoints.map(url => axios.get(url))).catch(err => {});
} catch {
throw Error("Promise failed");
}
return res;
}
// Returns so many 429 and only 3/4 data I know to expect
async function getData(endpoints) {
const someFunction = () => {
return new Promise(resolve => {
setTimeout(() => resolve('222'), 100)
})
}
const requestArr = endpoints.map(async data => {
let waitForThisData = await someFunction(data);
return axios.get(data)
.then(response => { console.log(response.data)})
.catch(error => console.log(error.toString()))
});
Promise.all(requestArr).then(() => {
console.log('resolved promise.all')
})
}
// Seems to get close to solving but once an error is it that Promise.all stops processing endpoint
async function getData(endpoints) {
(async () => {
try {
const allResponses = await Promise.all(
endpoints.map(url => axios.get(url).then(res => console.log(res.data)))
);
console.log(allResponses[0]);
} catch(e) {
console.log(e);
// handle errors
}
})();
}
It seems like I have so many relevant pieces but I cannot connect them in an efficient and working model. Perhaps axios has something completely unknown to me? I've also tried using blurbird concurrent to limit the request to 5 per attempt but that still returned the 429 from axios.
I've been starring at this for days and with so much new information swirling in my head I'm at a loss as to how to send 5 requests per second, await the response, then send another set of 5 requests to the API.
Guidance/links/ways to improve upon the question would be much appreciated.

How to fix MongoError: Cannot use a session that has ended

I'm trying to read data from a MongoDB Atlas collection using Node.js. When I try to read the contents of my collection I get the error MongoError: Cannot use a session that has ended. Here is my code
client.connect(err => {
const collection = client
.db("sample_airbnb")
.collection("listingsAndReviews");
const test = collection.find({}).toArray((err, result) => {
if (err) throw err;
});
client.close();
});
I'm able to query for a specific document, but I'm not sure how to return all documents of a collection. I've searched for this error, I can't find much on it. Thanks
In your code, it doesn't wait for the find() to complete its execution and goes on to the client.close() statement. So by the time it tries to read data from the db, the connection has already ended. I faced this same problem and solved it like this:
// connect to your cluster
const client = await MongoClient.connect('yourMongoURL', {
useNewUrlParser: true,
useUnifiedTopology: true,
});
// specify the DB's name
const db = client.db('nameOfYourDB');
// execute find query
const items = await db.collection('items').find({}).toArray();
console.log(items);
// close connection
client.close();
EDIT: this whole thing should be in an async function.
Ran into the same issue when I updated the MongoClient from 3.3.2 to the latest version (3.5.2 as of this writing.) Either install only 3.3.2 version by changing the package.json "mongodb": "3.3.2", or just use async and await wrapper.
If still the issue persists, remove the node_modules and install again.
One option is to use aPromise chain. collection.find({}).toArray() can either receive a callback function or return a promise, so you can chain calls with .then()
collection.find({}).toArray() // returns the 1st promise
.then( items => {
console.log('All items', items);
return collection.find({ name: /^S/ }).toArray(); //return another promise
})
.then( items => {
console.log("All items with field 'name' beginning with 'S'", items);
client.close(); // Last promise in the chain closes the database
);
Of course, this daisy chaining makes the code more synchronous. This is useful when the next call in the chain relates to the previous one, like getting a user id in the first one, then looking up user detail in the next.
Several unrelated queries should be executed in parallel (async) and when all the results are back, dispose of the database connection.
You could do this by tracking each call in an array or counter, for example.
const totalQueries = 3;
let completedQueries = 0;
collection.find({}).toArray()
.then( items => {
console.log('All items', items);
dispose(); // Increments the counter and closes the connection if total reached
})
collection.find({ name: /^S/ }).toArray()
.then( items => {
console.log("All items with field 'name' beginning with 'S'", items);
dispose(); // Increments the counter and closes the connection if total reached
);
collection.find({ age: 55 }).toArray()
.then( items => {
console.log("All items with field 'age' with value '55'", items);
dispose(); // Increments the counter and closes the connection if total reached
);
function dispose(){
if (++completedQueries >= totalQueries){
client.close();
}
}
You have 3 queries. As each one invokes dispose() the counter increments. When they've all invoked dispose(), the last one will also close the connection.
Async/Await should make it even easier, because they unwrap the Promise result from the then function.
async function test(){
const allItems = await collection.find({}).toArray();
const namesBeginningWithS = await collection.find({ name: /^S/ }).toArray();
const fiftyFiveYearOlds = await collection.find({ age: 55 }).toArray();
client.close();
}
test();
Below is an example of how Async/Await can end up making async code behave sequentially and run inefficiently by waiting for one async function to complete before invoking the next one, when the ideal scenario is to invoke them all immediately and only wait until they all are complete.
let counter = 0;
function doSomethingAsync(id, start) {
return new Promise(resolve => {
setTimeout(() => {
counter++;
const stop = new Date();
const runningTime = getSeconds(start, stop);
resolve(`result${id} completed in ${runningTime} seconds`);
}, 2000);
});
}
function getSeconds(start, stop) {
return (stop - start) / 1000;
}
async function test() {
console.log('Awaiting 3 Async calls');
console.log(`Counter before execution: ${counter}`);
const start = new Date();
let callStart = new Date();
const result1 = await doSomethingAsync(1, callStart);
callStart = new Date();
const result2 = await doSomethingAsync(2, callStart);
callStart = new Date();
const result3 = await doSomethingAsync(3, callStart);
const stop = new Date();
console.log(result1, result2, result3);
console.log(`Counter after all ran: ${counter}`);
console.log(`Total time to run: ${getSeconds(start, stop)}`);
}
test();
Note: Awaiting like in the example above makes the calls sequential again. If each takes 2 seconds to run, the function will take 6 seconds to complete.
Combining the best of all worlds, you would want to use Async/Await while running all calls immediately. Fortunately, Promise has a method to do this, so test() can be written like this: -
async function test(){
let [allItems, namesBeginningWithS, fiftyFiveYearOlds] = await Promise.all([
collection.find({}).toArray(),
collection.find({ name: /^S/ }).toArray(),
collection.find({ age: 55 }).toArray()
]);
client.close();
}
Here's a working example to demonstrate the difference in performance: -
let counter = 0;
function doSomethingAsync(id, start) {
return new Promise(resolve => {
setTimeout(() => {
counter++;
const stop = new Date();
const runningTime = getSeconds(start, stop);
resolve(`result${id} completed in ${runningTime} seconds`);
}, 2000);
});
}
function getSeconds(start, stop) {
return (stop - start) / 1000;
}
async function test() {
console.log('Awaiting 3 Async calls');
console.log(`Counter before execution: ${counter}`);
const start = new Date();
const [result1, result2, result3] = await Promise.all([
doSomethingAsync(1, new Date()),
doSomethingAsync(2, new Date()),
doSomethingAsync(3, new Date())
]);
const stop = new Date();
console.log(result1, result2, result3);
console.log(`Counter after all ran: ${counter}`);
console.log(`Total time to run: ${getSeconds(start, stop)}`);
}
test();
other people have touched on this but I just want to highlight that .toArray() is executed asynchronously so you need to make sure that it has finished before closing the session
this won't work
const randomUser = await db.collection('user').aggregate([ { $sample: { size: 1 } } ]);
console.log(randomUser.toArray());
await client.close();
this will
const randomUser = await db.collection('user').aggregate([ { $sample: { size: 1 } } ]).toArray();
console.log(randomUser);
await client.close();
client.connect(err => {
const collection = client
.db("sample_airbnb")
.collection("listingsAndReviews");
const test = collection.find({}).toArray((err, result) => {
if (err) throw err;
client.close();
});
});

Resources