gearman and retrying workers with unreliable external dependencies - gearman

I'm using gearman to queue a variety of different jobs, some which can always be serviced immediately, and some which can "fail", because they require an unreliable external service. (For example, sending email might require an SMTP server that's frequently unavailable.)
If an external service goes down, I'd like to keep all jobs which require that service on the queue, and retry one job occasionally (every few minutes, say) until the service becomes available again. (Perhaps optionally sending email if the service has not been available for hours.)
However I'd like jobs that don't require a failed service to be passed on to workers as soon as possible. How can this be achieved? (I'm happy to put some of the logic in the workers if necessary, although it seems to be a bit "late" to throttle on the worker side.)

Gearman should already be handle this. As long as you have some workers which specialise in handling jobs with unreliable dependancies and don't handle other jobs, along with some workers that either do all jobs, or just jobs without unreliable dependencies.
All you would need to do it add some code the unreliable dependancy workers so that they only accept jobs once that have checked that the dependent service is running, if the service is down then just have them wait a bit and retest the service (and continue ad infinitum), once the service is up then have them join the gearmand server, do job, return work, retest service, etc etc.
While the dependent service is down, the workers that don't handle jobs that need the service will keep on trundling through the job queue for the other jobs. Gearmand won't block an entire job queue (or worker) on one job type if there are workers available to handle other job types.
The key is to be sensible about how you define your job types and workers.
EDIT--
Ah-ha, I knew my thinking was a little out, (I wrote my gearman system about a year ago and haven't really touched it since). My solution to this type of issue was to have all the workers that normally handle dependent-job unregister their dependent job handling capability with the gearmand server once a failure was detected with the dependent service. (and any workers that are currently trying to complete that job should return a failure.) Once the service is backup - get those same workers to reregister their ability to handle that job. Do note this does require another channel of communications for the workers to be notified of the status of the dependent services.
Hope this helps

Related

Google Cloud Run - One container handling multiple similar requests with queue for each user

I have a SERVICE that gets a request from a Webhook and this is currently deployed across seperate Cloud Run containers. These seperate containers are the exact same (image), however, each instance processes data seperately for each particular account.
This is due to a ~ 3-5 min processing of the request and if the user sends in more requests, it needs to wait for the existing process to be completed for that particular user before processing the next one to avoid racing conditions. The container can still receive webhooks though, however, the actual processing of the data itself needs to be done one by one for each account.
Is there no way to reduce the container count, as such for example, to use one container to process all the requests, while still ensuring it processes one task for each user at a time and waits for that to complete for that user, before processing the next request from the same user.
To explain it better, i.e.
Multiple tasks can be run across all the users
However, per user 1 task at a time processed; Once that is completed, the next task for that user can be processed
I was thinking of monitoring the tasks through a Redis Cache, however, with Cloud Run being stateless, I am not sure that is the right way to go.
Or seperating the requests and the actual work - Master / Worker - And having the worker report back to the master once a task is completed for the user across 2 images (Using the concurrency to process multiple tasks across the users), however that might mean that I would have to increase the timeout time for Cloud Run.
Good to hear any other suggestions.
Apologies if this doesn't seem clear, feel free to ask for more information.

How to access to worker's queued requests?

I'm implementing a web server using nodejs which must serve a lot of concurrent requests. As nodejs processes the requests one by one, it keeps them in an internal queue (in libuv, I guess).
I also want to run my web server using cluster module, so there will be one requests queue per worker.
Questions:
If any worker dies, how can I retrieve its queued
requests?
How can I put retrieved requests into other workers' queues?
Is there any API to access to alive workers' requests queue?
By No. 3 I want to keep queued requests somewhere such as Redis (if possible), so in case of server crash, failure or even hardware restart I can retrieve them.
As you mentioned in the tags that you are-already-using/want-to-use redis, you can use queue-manager based on redis to do all the work for you.
Checkout https://github.com/OptimalBits/bull (or it's alternatives).
bull has a concept of queue. you add jobs to the queue and listen to the same queue from different processes/vms. bull will send the same job to only one listener and you have the ability to control how many jobs each listener is processing at the same time (concurrency-level).
In addition, if one of the jobs fails to run (in other words, the listener of the queue threw an error), bull will try to give the same job to different listener.

Does Node.js need a job queue?

Say I have a express service which sends email:
app.post('/send', function(req, res) {
sendEmailAsync(req.body).catch(console.error)
res.send('ok')
})
this works.
I'd like to know what's the advantage of introducing a job queue here? like Kue.
Does Node.js need a job queue?
Not generically.
A job queue is to solve a specific problem, usually with more to do than a single node.js process can handle at once so you "queue" up things to do and may even dole them out to other processes to handle.
You may even have priorities for different types of jobs or want to control the rate at which jobs are executed (suppose you have a rate limit cap you have to remain below on some external server or just don't want to overwhelm some other server). One can also use nodejs clustering to increase the amount of tasks that your node server can handle. So, a queue is about controlling the execution of some CPU or resource intensive task when you have more of it to do than your server can easily execute at once. A queue gives you control over the flow of execution.
I don't see any reason for the code you show to use a job queue unless you were doing a lot of these all at once.
The specific https://github.com/OptimalBits/bull library or Kue library you mention lists these features on its NPM page:
Delayed jobs
Distribution of parallel work load
Job event and progress pubsub
Job TTL
Optional retries with backoff
Graceful workers shutdown
Full-text search capabilities
RESTful JSON API
Rich integrated UI
Infinite scrolling
UI progress indication
Job specific logging
So, I think it goes without saying that you'd add a queue if you needed some specific queuing features and you'd use the Kue library if it had the best set of features for your particular problem.
In case it matters, your code is sending res.send("ok") before it finishes with the async tasks and before you know if it succeeded or not. Sometimes there are reasons for doing that, but sometimes you want to communicate back whether the operation was successful or not (which you are not doing).
Basically, the point of a queue would simply be to give you more control over their execution.
This could be for things like throttling how many you send, giving priority to other actions first, evening out the flow (i.e., if 10000 get sent at the same time, you don't try to send all 10000 at the same time and kill your server).
What exactly you use your queue for, and whether it would be of any benefit, depends on your actual situation and use cases. At the end of the day, it's just about controlling the flow.

Distributing topics between worker instances with minimum overlap

I'm working on a Twitter project, using their streaming API, built on Heroku with Node.js.
I have a collection of topics that my app needs to process, which are pulled from MongoDB. I need to track each of these topics via the API, however it needs to be done such that each topic is tracked only once. As each worker process expires after approximately 1 hour, when a worker receives SIGTERM it needs to untrack each topic assigned, and release it back to the pool again.
I've been using RabbitMQ to communicate between app and worker processes, however with this I'm a little stuck. Are there any good examples, or advice you can offer on the correct way to do this?
Couldn't the worker just send a message via the messagequeue to the application when it receives a SIGTERM? According to the heroku docs on shutdown the process is allowed a couple of seconds (10) before it will be forecefully killed.
So you can do something like this:
// listen for SIGTERM sent by heroku
process.on('SIGTERM', function () {
// - notify app that this worker is shutting down
messageQueue.sendSomeMessageAboutShuttingDown();
// - shutdown process (might need to wait for async completion
// of message delivery to not prevent it from being delivered)
process.exit()
});
Alternatively you could break up your work in much smaller chunks and have workers only 'take' work that will run for a couple of minutes or even seconds max. Your main application should be the bookkeeper and if a process doesn't complete its task within a specified time assume it has gone missing and make the task available for another process to handle. You can probably also implement this behavior using confirms in rabbitmq.
RabbitMQ won't do this for you.
It will allow you to distribute the work to another process and/or computer, but it won't provide the kind of mechanism you need to prevent more than one process / computer from working on a particular topic.
What you want is a semaphore - a way to control access to a particular "resource" from multiple processes... a way to ensure only one process is working on a particular resource at a given time. In your case the "resource" will be the topic... but it will still be the resource that you want to control access to.
FWIW, there has been discussion of using RabbitMQ to implement a distributed semaphore in the past:
https://www.rabbitmq.com/blog/2014/02/19/distributed-semaphores-with-rabbitmq/
https://aphyr.com/posts/315-call-me-maybe-rabbitmq
but the general consensus is that this is a bad idea. there are too many edge cases and scenarios in which RabbitMQ will fail to work as proper semaphore.
There are some node.js semaphore libraries available. I would recommend looking at them, and using one of them. Have a single process manage the semaphore and decide which other process can / cannot work on which topic.

How can I prevent similar queues from running at the same time?

We currently process a set of tasks using Queue workers in Laravel. When I am using multiple threads of php artisan queue:work jobs end up running together (async). We are using Beanstalkd as the queue driver.
The issue is that in the queue work we are polling an API that only allows one concurrent session for a particular agent_id. That is, only one API call with the same agent_id can run at a time.
We thought of spinning up multiple php artisan queue:work threads with a filter on the queue_name matching the agent_id but we have over 500 agents therefore we would need 500 threads so this is not ideal.
Is there anyway to implement a lock style feature for each agent_id so that if a job is already running for a particular agent_id it will send it back to the queue? Or are there any features of beanstalkd that would allow for this?
The other option could also be to gracefully handle the rejection from the API when the user is already logged in (and send the job back to the queue). But this could get messy and could clutter the logs.
You could either run only a single worker that is capable of running the fetch-from-API job, or use some sort of external marshalling/lock service.
The options for that, may be either an internal rate limiting system, or some kind of common atomically locking system. A memcached or redis server where a worker tries to set a lock-key, and only the agent that successfully sets it, gets to work on the task. An advantage of that may be that as soon as the API request has been completed, you can remove the lock, and then while the worker processes the results, a different worker can make a new request.

Resources