I have a NodeJS / background-process issue, that I don't know how to solve it 'elegant', straight, the right way.
The user submits some (like ~10 or more) URLs via a textarea and then they should be processed asynchronous. [a screenshot with puppeteer has to be taken, some information gathered, the screenshot should be processed with sharp and the result should be persisted in a MongoDB. The screenshot via GridFS and the URL in an own collection with a reference to the screenshot].
While this async process is calculated in the background, the page should be updated whenever a URL got processed.
There are so many ways to do that, but which one is the most correct/straightforward/resource saving way?
Browserify and I do it in the browser? No, too much stuff on the client side.. AJAX/Axios posts and wait for the URLs to be processed and reflect the results on the side? Trigger the process before the response gets send back to the client or let the client start the processing?
So, I made a workflow engine of some sort that supports long-running jobs. And I followed this tutorial https://farazdagi.com/2014/rest-and-long-running-jobs/
Which is nothing, when a request is created you just return a status code and when the jobs are completed you just log them somewhere and use that.
For, this I used EventEmitter which is used inside a promise. It's only my solution maybe not elegant, maybe outright wrong. Made a little POC for you.
const events = require('events')
const emitter = new events.EventEmitter();
const actualWork = function() {
return new Promise((res,rej)=>{
setTimeout(res, 1000);
})
}
emitter.on('workCompleted', function(){
// log somewhere
});
app.get('/someroute', (req,res)=>{
res.json({msg:'reqest initiated', id: 'some_id'})
actualWork()
.then(()=>{
emitter.emit('workCompleted', {id: 'some_id'});
});
})
app.get('/someroute/id/status', (req,res)=>{
//get the log
})
Related
In Heroku long requests can cause H12 timeout errors.
The request must then be processed...by your application...within 30 seconds to
avoid the timeout.
src
Heroku suggests moving long tasks to background jobs.
Sending an email...Accessing a remote API...
Web scraping / crawling...you should move this heavy lifting into a background job which can run asynchronously from your web request.
src
Heroku's docs say requests shouldn't take longer than 500ms to return a response.
It’s important for a web application to serve end-user requests as
fast as possible. A good rule of thumb is to avoid web requests which
run longer than 500ms. If you find that your app has requests that
take one, two, or more seconds to complete, then you should consider
using a background job instead.
src
So if I have a background job, how do I tell the frontend when the background job is done and what the job returns?
On Heroku their example code just returns the background job id. But this won't give the frontend the information it needs.
app.post('/job', async (req, res) => {
let job = await workQueue.add();
res.json({ id: job.id });
});
For example this method won't tell the frontend when an image is done being uploaded. Or the frontend won't know when a call to an API, like an external exchange rate API, returns a result, like an exchange rate, and what that result is.
Someone suggested using job.finished() but doesn't this get you back where you started? Now your requests are waiting for the queue to finish in order to respond. So your requests are the a same length as when there was no queue and this could lead to timeout errors again.
const result = await job.finished();
res.send(result);
This is example uses Bull, Redis, Node.js.
Someone suggested websockets. I didn't find an example of this yet.
The idea of using a queue for long tasks is that you post the task and
then return immediately. I guess you are updating the database as last
step in your job, and only use the completed event for notifying the
clients. What you need to do in this case is to implement either a
websocket or similar realtime communication and push the notification
to relevant clients. This can become complicated so you can save some
time with a solution like https://pusher.com/ or similar...
https://github.com/OptimalBits/bull/issues/1901
I also saw a solution in heroku's full example, which I didn't originally see:
web server
// Fetch updates for each job
async function updateJobs() {
for (let id of Object.keys(jobs)) {
let res = await fetch(`/job/${id}`);
let result = await res.json();
if (!!jobs[id]) {
jobs[id] = result;
}
render();
}
}
frontend
// Fetch updates for each job
async function updateJobs() {
for (let id of Object.keys(jobs)) {
let res = await fetch(`/job/${id}`);
let result = await res.json();
if (!!jobs[id]) {
jobs[id] = result;
}
render();
}
}
// Attach click handlers and kick off background processes
window.onload = function() {
document.querySelector("#add-job").addEventListener("click", addJob);
document.querySelector("#clear").addEventListener("click", clear);
setInterval(updateJobs, 200);
};
I'm using #google-cloud/logging to log some stuff out of my express app over on Cloud Run.
Something like this:
routeHandler.ts
import { Logging } from "#google-cloud/logging";
const logging = new Logging({ projectId: process.env.PROJECT_ID });
const logName = LOG_NAME;
const log = logging.log(logName);
const resource = {
type: "cloud_run_revision",
labels: { ... }
};
export const routeHandler: RequestHandler = (req,res,next) => {
try {
// EXAMPLE: LOG A WARNING
const metadata = { resource, severity: "WARNING" };
const entry = log.entry(metadata,"SOME WARNING MSG");
await log.write(entry);
return res.sendStatus(200);
}
catch(err) {
// EXAMPLE: LOG AN ERROR
const metadata = { resource, severity: "ERROR" };
const entry = log.entry(metadata,"SOME ERROR MSG");
await log.write(entry);
return res.sendStatus(500);
}
};
You can see that the log.write(entry) is asynchronous. So, in theory, it would be recommended to await for it. But here is what the documentation from #google-cloud/logging says:
Doc link
And I got no problem with that. In my real case, even if the log.write() fails, it is inside a try-catch and any errors will be handled just fine.
My problem is that it kind of conflicts with the Cloud Run documentation:
Doc link
Note: If I don't wait for the log.write() call, I'll end the request cycle by responding to the request
And Cloud Run does behave like that. A couple weeks back, I tried to respond immediately to the request and fire some long background job. And the process kind of halted for a while, and I think it restarted once it got another request. Completely unpredictable. And when I ran this test I'm mentioning here, I even had a MIN_INSTANCE=1 set on my cloud run service container. Even that didn't allow my background job to run smoothly. Therefore, I don't think it's fine to leave the process doing background stuff when I've finished handling a request (by doing the "fire and forget" approach).
So, what should I do here?
Posting this answer as a Community Wiki based on #Karl-JorhanSjögren's correct assumption in the comments.
For Log calls on apps running in Cloud Run you are indeed encouraged to take a Fire and Forget approach, since you don't really need to force synchronicity on that.
As mentioned in the comments replying to your concern on the CPU being disabled after the request is fulfilled, the CPU will be throttled first so that the instance can be brought back up quickly and completely disabled after a longer period of inactivity. So firing of small logging calls that in most cases will finish within milliseconds shouldn't be a problem.
What is mentioned in the documentation is targeted at processes that run for Longer periods of time.
How to prevent new requests before sending the response to the last request. on On the other hand just process one request at the same time.
app.get('/get', function (req, res) {
//Stop enter new request
someAsyncFunction(function(result){
res.send(result);
//New Request can enter now
}
}
Even tho I agree with jfriend00 that this might not be the optimal way to do this, if you see that it's the way to go, I would just use some kind of state management to check if it's allowed to access that /get request and return a different response if it's not.
You can use your database to do this. I strongly recommend using Redis for this because it's in-memory and really quick. So it's super convenient. You can use mongodb or mysql if you prefer so, but Redis would be the best. This is how it would look, abstractly -
Let's say you have an entry in your database called isLoading, and it's set to false by default.
app.get('/get', function (req, res) {
//get isloading from your state management of choice and check it's value
if(isLoading == true) {
// If the app is loading, notify the client that he should wait
// You can check for the status code in your client and react accordingly
return res.status(226).json({message: "I'm currently being used, hold on"})
}
// Code below executes if isLoading is not true
//Set your isLoading DB variable to true, and proceed to do what you have
isLoading = true
someAsyncFunction(function(result){
// Only after this is done, isLoading is set to false and someAsyncFunction can be ran again
isLoading = false
return res.send(result)
}
}
Hope this helps
Uhhhh, servers are designed to handle multiple requests from multiple users so while one request is being processed with asynchronous operations, other requests can be processed. Without that, they don't scale beyond a few users. That is the design of any server framework for node.js, including Express.
So, whatever problem you're actually trying to solve, that is NOT how you should solve it.
If you have some sort of concurrency issue that is pushing you to ask for this, then please share the ACTUAL concurrency problem you need to solve because it's much better to solve it a different way than to handicap your server into one request at a time.
I am using this (contentful-export) library in my express app like so
const app = require('express');
...
app.get('/export', (req, rex, next) => {
const contentfulExport = require('contentful-export');
const options = {
...
}
contentfulExport(options).then((result) => {
res.send(result);
});
})
now this does work, but the method takes a bit of time and sends status / progress messages to the node console, but I would like to keep the user updated also.. is there a way I can send the node console progress messages to the client??
This is my first time using node / express any help would be appreciated, I'm not sure if this already has an answer since im not entirely sure what to call it?
Looking of the documentation for contentful-export I don't think this is possible. The way this usually works in Node is that you have an object (contentfulExport in this case), you call a method on this object and the same object is also an EventEmitter. This way you'd get a hook to react to fired events.
// pseudo code
someLibrary.on('someEvent', (event) => { /* do something */ })
someLibrary.doLongRunningTask()
.then(/* ... */)
This is not documented for contentful-export so I assume that there is no way to hook into the log messages that are sent to the console.
Your question has another tricky angle though. In the code you shared you include a single endpoint (/export). If you would like to display updates or show some progress you'd probably need a second endpoint giving information about the progress of your long running task (which you can not access with contentful-export though).
The way this is usually handled is that you kick of a long running task via a certain HTTP endpoint and then use another endpoint that serves infos via polling or or a web socket connection.
Sorry that I can't give a proper solution but due to the limitation of contentful-export I don't think there is a clean/easy way to show progress of the exported data.
Hope that helps. :)
I have a mongodb and NodeJS setup on expressJS. What this API basically does is storing e-mail adresses and other information about users.
These are called personas and are stored in a MongoDB database. What I'm trying to do now is calling a url in my app, which sends all personas to the Mailchimp API.
However, as the amount of personas that are stored is quite high (144.000), I can not send them in one batch to the Mailchimp API. What I'm trying to do is send them in batches, without much luck.
How would I go about to set this up? Currently I'm using the Async package to limit the simultaneous sends to the Mailchimp API. But I'm not sure if this is the correct way to go.
I guess the code below is not working, as the personas-array I collect is too big to fit in the memory. But I'm not sure how to chunk it up in a correct way.
//This is a model function which searches the database to collect all personas
Persona.getAllSubscriptions(function(err, personas) {
//Loop send each persona to mailchimp
var i = 1;
//This is the async module I'm using to limit the simultaneous requests to Mailchimp
async.forEachLimit(personas, 10, function (persona, callback) {
//This is the function to send one item to mailchimp
mailchimpHelper.sendToMailchimp(persona, mailchimpMergefields, function(err,body){
if(err) {
callback(err);
} else if(!body) {
callback(new Error("No response from Mailchimp"));
} else {
console.log(i);
i++;
callback();
}
});
}, function(err) {
if (err) console.log(err);
//Set a success message
res.json({error: false, message: "All personas updated"});
});
});
I ran into a similar problem with a query to a collection that could return more than 170,000 documents. I ended up using the "stream" API to build batches to be processed. You could do something similar to "build" batches to send to MailChimp.
Here's an example.
var stream = db.collection.find().stream(); //be sure find is returning a cursor
var batch = []
this.stream.on('data', function(data){
batch.push(data);
if(batch.length >= maxBatchSize){
stream.pause();
// send batch to mail chimp
}
});
this.stream.on('pause', function(){
// send batch to mailChimp
// when mailChimp has finished
stream.resume();
});
this.stream.on('end', ()=>{
// data finished
});
You can look at the documentation for cursor and stream here
Hope this helps.
Cheers.
There seem to be some things that I wouldn't do so like you described. You are trying to quite heavy processing inside the node server. The trigger by url could cause you a lot of problems if you do not secure it.
Also, this is a heavy process which is better to be implemented as queue-worker approach separated from the server. This would give you more control over the process, some of the email sendings might fail or error might occur on the mailchimp side(API is down etc). So instead of triggering directly sending, just trigger worker and process emails as chunks as #jackfrster described.
Make sure you have checked the Mailchimp API limits. Do you have considered alternatives like creating campaign and send out the campaign so you would not need to sending for each person in list ?