Looking for some advice on how to do the following:
Receive request from website for certain long running process (~10-30seconds)
Website backend schedules a job and puts onto distributed queue .. could be SQS/Kue/resque
A worker takes the job off the queue and processes it. Stores result somewhere.
Website backend subscribes to job complete event and gets the result of processed job.
Website backend closes request to website with result of the task.
1,2 and 3 are fine. I am just finding it tricky to pass the result of a queued task back to the backend so that it can close the request.
Polling from the website isnt an option - the request has to stay open for however long the task takes to be processed. I'm using nodejs.
2 - 4 are all happening on the server side. There is nothing stopping you from polling the expected result location (on the server side) for the result and then returning the result when it finally appears.
Client sends requests
Server starts job and begins polling for the result
The result comes back so the poll loop on the server side ends
Server sends result back to client
The client-server connection is finally severed
You could get even more efficient code going if the job can execute a url when it finishes. In this case your service would have two endpoints... one for the client to start the process, and another that your job queue can call.
Client sends requests
Server starts job... saves the response callback in a global object so that it is not closed (I'm assuming something like express here)
openJobs.push({ id: 12345, res: res });
jobQueue.execute({ id: 12345, data: {...}});
When the job finishes and saves the result, call the service url with the id
You can check that the job has actually finished and remove the job from the openJobs list
Finish the original response
openJob.res.send(data);
This will send the data and close the original client-server connection.
The overall result is that you have no polling at all... which is cool.
Of course... In either of these scenarios you are screwed if your server shuts down in the middle of a batch... This is why I would recommend something like socket.io in this scenario. You would queue the results of jobs somewhere and socket.io would poll/wait for callbacks on the list and push to the client when there are new items. This is better because if the server crashes no biggie - the client will re-connect once the server comes back up.
Related
I have Web application on IIS server.
I have POST method that take a long time to run (Around 30-40min).
After period time the application stop running (Without any exception).
I set Idle timeout to be 0 and It is not help for me.
What can I do to solve it?
Instead of doing all the work initiated by the request before responding at all:
Receive the request
Put the information in the request in a queue (which you could manage with a database table, ZeroMQ, or whatever else you like)
Respond with a "Request recieved" message.
That way you respond within seconds, which is acceptable for HTTP.
Then have a separate process monitor the queue and process the data on it (doing the 30-40 minute long job). When the job is complete, notify the user.
You could do this through the browser with a Notification or through a WebSocket or use a completely different mechanism (such as by sending an email to the user who made the request).
I have a web service that accepts post requests. A post request specifies a specific job to be executed in the background, that modifies a database used for later analysis. The sender of the request does not care about the result, and only needs to receive a 202 acknowledgment from the web service.
How it was implemented so far:
Flask Web service will get the http request , and add the necessary parameters to the task queue (rq workers), and return back an acknowledgement. A separate rq worker process listens on the queue and processes the job.
We have now switched to aiohttp, and realized that the web service can now schedule the actual job request in its own event loop, by using the aiohttp.ensure_future() method.
This however blurs the lines between the web-server and the task queue. On the positive side, it eliminates the need of having to manage the rq workers.
Is this considered a good practice?
If your tasks are not CPU heavy - yes, it is good practice.
But if so, then you need to move them to separate service or use run_in_executor(). In other case your aiohttp event loop will be blocked by this tasks and server will not be able to accept new requests.
Yesterday I was giving some presentation on nodeJS.
Some one asked me following question:
As we know nodeJS is a single threaded server, several request is
coming to the server and it pushes all request to event loop. What if
two request is coming to server at exact same time, how server will
handle this situation?
I guessed a thought and replied back following:
I guess No two http request can come to the server at exact same
time, all request come through a single socket so they will be in queue. Http
request has following format:
timestamp of request is contained in it's header and they may be pushed to the event loop depending upon their timestamp in header.
but I'm not sure whether I gave him right or wrong answer.
I guess No two http request can come to the server at exact same time,
all request come through pipe so they will be in queue.
This part is basically correct. Incoming connections go into the event queue and one of them has to be placed in the queue first.
What if two request is coming two server at exact same time, how
server will handle this situation?
Since the server is listening for incoming TCP connections on a single socket in a single process, there cannot be two incoming connections at exactly the same time. One will be processed by the underlying operating system slightly before the other one. Think of it this way. An incoming connection is a set of packets over a network connection. One of the incoming connections will have its packets before the other one.
Even if you had multiple network cards and multiple network links so two incoming connections could literally arrive at the server at the exact same moment, the node.js queue will be guarded for concurrency by something like a mutex and one of the incoming connections will grab the mutex before the other and get put in the event queue before the other.
The one that is processed by the OS first will be put into the node.js event queue before the other one. When node.js is available to process the next item in the event queue, then whichever incoming request was first in the event queue will start processing first.
Because node.js JS execution is single threaded, the code processing that request will run its synchronous code to completion. If it has an async operation, then it will start that async operation and return. That will then allow the next item in the event queue to be processed and the code for the second request will start running. It will run its synchronous to completion. Like with the first request, if it has an async operation, then it will start that async operation and return.
At that point after both requests have started their async operations and then returned, then its just up to the event queue. When one of those async operations finishes, it will post another event to the event queue and when the single thread of node.js is free, it will again process the next item in the event queue. If both requests have lots of async operations, their progress could interleave and both could be "in-flight" at the same as they fire an async operation and then return until that async operation completes where their processing picks up again when node.js is free to process that next event.
timestamp of request is contained in it's header and they may be
pushed to the event loop depending upon their timestamp in header.
This part is not really right. Incoming events of the same type are added to the queue as they arrive. The first ones to arrive go into the queue first - there's isn't any step that examines some timestamp.
Multiple concurrent HTTP requests are not a problem. Node.js is asynchronous, and will handle multiple requests in the event-loop without having to wait for each one to finish.
For reference, example: https://stackoverflow.com/a/34857298/1157037
I have an application where users are running a geospatial query against a mongo database. The query can return many thousands of results (~50k). These results are then streamed to the client over a websocket. However, users can abort a request mid result set and execute a new query. Users will frequently start, abort, and re-start requests on the order of several times per minute. Sometimes they even cancel/restart every couple of seconds.
The question is, when a user aborts a request, how do I cancel the query on the server so it doesn't continue to tie up resources streaming back thousands of unneeded results? I'm currently calling destroy() on the cursor, but it's not clear that this is actually stopping the query from executing on the server.
What's the best practice in this case?
Have you tried this?
db.currentOp()
db.killOp(IDRETURNEDHE)
This is a good example.
The answer is it depends upon a lot of your implementation details.
If your server is in the middle of streaming results (e.g. still hasn't sent or queued everything) when the server receives some sort of other message that the previous results should be cancelled, then it is possible for you to communicate with that other stream and tell it to stop sending. How exactly you would do that depends entirely upon your code and you would have to show us your code for us to know.
Chances are the db query is long since complete and what is going on is the server is in the process of streaming results to the client. So, if that's the case, then it isn't the db you're looking for, it's the code that streams the response to the client. Since node.js JS is single threaded, the only time another request would actually get run on the server would be while the streaming code was in some async write operation, waiting for that to finish. You would probably have to set some flag that was uniquely associated with a particular user and then your stream code would have to check for that flag before each chunk of data was sent. If it saw the cancel flag, it could abandon sending the rest of the results.
You could make things more cancellable by explicitly chunking your results (say 500 at a time) and checking for a cancel flag between the sending of each chunk.
If, on the other hand, all the data has already been buffered up by the TCP layer on the server, then the only way to stop that from being sent is to tear down the webSocket and force the client to reconnect.
I have an app that needs to run a very long process (takes 30-60 seconds for each request). After the processing, the result is then returned to the request as a response. This works fine locally, but it crashes my Heroku instance.
What I'd like to happen instead is:
User comes on site, request sent to backend
Backend returns immediately, and kicks off another process/task/job that does the processing
When the processing ends, the response is returned to the correct user.
I am not sure what all I need for this. Based on an hour-long research, it seems like I can use Redis as a queue and a worker can poll it every x minutes. But what I can't understand is how to figure out which request to send the response to after processing ends.
Is there a sample Express/node.js for this? Any pointers are helpful.
Like you found in your research, setting up a worker queue using Redis is a good approach for long running processes. A nice library for this is kue (https://github.com/learnboost/kue).
When it comes to responding to a request with the results of the job, having an outanding requesting hanging waiting for a response is not a good way to go about it (and may not work, heroku kills requests that have been idle for a certain period of time).
What you could do is when the request is made start the background job and respond to the request right away with job ID. The client can then poll the server for the status of the job, when the job is complete it can then fetch the needed result.
Kue (from #mattetre's answer) is not maintained anymore. Kue's GitHub page suggests Bull as a good alternative. It is a fast and reliable Redis based queue for Node.js.