Error conditions and retries in gearman? - gearman

Can someone guide me on how gearman does retries when exceptions are
thrown or when errors occur?
I use the python gearman client in a Django app and my workers are
initiated as a Django command. I read from this blog post that retries
from error conditions are not straight forward and that it requires
sys.exit from the worker side.
Has this been fixed to retry perhaps with sendFail or sendException?
Also does gearman support retries with exponentials algorithm – say if
an SMTP failure happens its retries after 2,4,8,16 seconds etc?

To my understanding, Gearman employs a very "it's not my business" approach - e.g., it does not intervene with jobs performed, unless workers crash. Any success / failure messages are supposed to be handled by the client, not Gearman server itself.
In foreground jobs, this implies that all sendFail() / sendException() and other send*() are directed to the client and it's up to the client to decide whether to retry the job or not. This makes sense as sometimes you might not need to retry.
In background jobs, all the send*() functions lose their meaning, as there is no client that would be listening to the callbacks. As a result, the messages sent will be just ignored by Gearman. The only condition on which the job will be retried is when the worker crashes (which can by emulated with a exit(XX) command, where XX is a non-zero value). This, of course, is not something you want to do, because workers are usually supposed to be long-running processes, not the ones that have to be restarted after each unsuccessful job.
Personally, I have solved this problem by extending the default GearmanJob class, where I intercept the calls to send*() functions and then implementing the retry mechanism myself. Essentially, I pass all the retry-related data (max number of retries, times already retried) together with a workload and then handle everything myself. It is a bit cumbersome, but I understand why Gearman works this way - it just allows you to handle all the application logic.
Finally, regarding the ability to retry jobs with exponential timeout (or any timeout for that matter). Gearman has a feature to add delayed jobs (look for SUBMIT_JOB_EPOCH in the protocol documentation), yet I am not sure about its status - the PHP extension and, I think, the Python module do not support it and the docs say it can be removed in the future. But I understand it works at the moment - you just need to submit raw socket requests to Gearman to make it happen (and the exponential part should be implemented on your side, too).
However, this blog post argues that SUBMIT_JOB_EPOCH implementation does not scale well. He uses node.js and setTimeout() to make it work, I've seen others use the unix utility at to do the same. In any way - Gearman will not do it for you. It will focus on reliability, but will let you focus on all the logic.

Related

How to send a message to ReactPHP/Amp/Swoole/etc. from PHP-FPM?

I'm thinking about making a worker script to handle async tasks on my server, using a framework such as ReactPHP, Amp or Swoole that would be running permanently as a service (I haven't made my choice between these frameworks yet, so solutions involving any of these are helpful).
My web endpoints would still be managed by Apache + PHP-FPM as normal, and I want them to be able to send messages to the permanently running script to make it aware that an async job is ready to be processed ASAP.
Pseudo-code from a web endpoint:
$pdo->exec('INSERT INTO Jobs VALUES (...)');
$jobId = $pdo->lastInsertId();
notify_new_job_to_worker($jobId); // how?
How do you typically handle communication from PHP-FPM to the permanently running script in any of these frameworks? Do you set up a TCP / Unix Socket server and implement your own messaging protocol, or are there ready-made solutions to tackle this problem?
Note: In case you're wondering, I'm not planning to use a third-party message queue software, as I want async jobs to be stored as part of the database transaction (either the whole transaction is successful, including committing the pending job, or the whole transaction is discarded). This is my guarantee that no jobs will be lost. If, worst case scenario, the message cannot be sent to the running service, missed jobs may still be retrieved from the database at a later time.
If your worker "runs permanently" as a service, it should provide some API to interact through. I use AmPHP in my project for async services, and my services implement HTTP/Websockets servers (using Amp libraries) as an API transport.
Hey ReactPHP core team member here. It totally depends on what your ReactPHP/Amp/Swoole process does. Looking at your example my suggestion would be to use a message broker/queue like RabbitMQ. That way the process can pic it up when it's ready for it and ack it when it's done. If anything happens with your process in the mean time and dies it will retry as long as it hasn't acked the message. You can also do a small HTTP API but that doesn't guarantee reprocessing of messages on fatal failures. Ultimately it all depends on your design, all 3 projects are a toolset to build your own architectures and systems, it's all up to you.

If Redis is single Threaded, how can it be so fast?

I'm currently trying to understand some basic implementation things of Redis. I know that redis is single-threaded and I have already stumbled upon the following Question: Redis is single-threaded, then how does it do concurrent I/O?
But I still think I didn't understood it right. Afaik Redis uses the reactor pattern using one single thread. So If I understood this right, there is a watcher (which handles FDs/Incoming/outgoing connections) who delegates the work to be done to it's registered event handlers. They do the actual work and set eg. their responses as event to the watcher, who transfers the response back to the clients. But what happens if a request (R1) of a client takes lets say about 1 minute. Another Client creates another (fast) request (R2). Then - since redis is single threaded - R2 cannot be delegated to the right handler until R1 is finished, right? In a multithreade environment you could just start each handler in a single thread, so the "main" Thread is just accepting and responding to io connections and all other work is carried out in own threads.
If it really just queues the io handling and handler logic, it could never be as fast it is. What am I missing here?
You're not missing anything, besides perhaps the fact that most operations in Redis complete in less than a ~millisecond~ couple of microseconds. Long running operations indeed block the server during their execution.
Let’s say if there were 10,000 users doing live data pulling with 10 seconds each on hmget, and on the other side, server were broadcasting using hmset, redis can only issue the set at the last available queue.
Redis is only good for queuing and handle limited processing like inserting lazy last login info, but not for live info broadcasting, in this case, memcached will be the right choice. Redis is single threaded, like FIFO.

Does Node.js need a job queue?

Say I have a express service which sends email:
app.post('/send', function(req, res) {
sendEmailAsync(req.body).catch(console.error)
res.send('ok')
})
this works.
I'd like to know what's the advantage of introducing a job queue here? like Kue.
Does Node.js need a job queue?
Not generically.
A job queue is to solve a specific problem, usually with more to do than a single node.js process can handle at once so you "queue" up things to do and may even dole them out to other processes to handle.
You may even have priorities for different types of jobs or want to control the rate at which jobs are executed (suppose you have a rate limit cap you have to remain below on some external server or just don't want to overwhelm some other server). One can also use nodejs clustering to increase the amount of tasks that your node server can handle. So, a queue is about controlling the execution of some CPU or resource intensive task when you have more of it to do than your server can easily execute at once. A queue gives you control over the flow of execution.
I don't see any reason for the code you show to use a job queue unless you were doing a lot of these all at once.
The specific https://github.com/OptimalBits/bull library or Kue library you mention lists these features on its NPM page:
Delayed jobs
Distribution of parallel work load
Job event and progress pubsub
Job TTL
Optional retries with backoff
Graceful workers shutdown
Full-text search capabilities
RESTful JSON API
Rich integrated UI
Infinite scrolling
UI progress indication
Job specific logging
So, I think it goes without saying that you'd add a queue if you needed some specific queuing features and you'd use the Kue library if it had the best set of features for your particular problem.
In case it matters, your code is sending res.send("ok") before it finishes with the async tasks and before you know if it succeeded or not. Sometimes there are reasons for doing that, but sometimes you want to communicate back whether the operation was successful or not (which you are not doing).
Basically, the point of a queue would simply be to give you more control over their execution.
This could be for things like throttling how many you send, giving priority to other actions first, evening out the flow (i.e., if 10000 get sent at the same time, you don't try to send all 10000 at the same time and kill your server).
What exactly you use your queue for, and whether it would be of any benefit, depends on your actual situation and use cases. At the end of the day, it's just about controlling the flow.

Distributing topics between worker instances with minimum overlap

I'm working on a Twitter project, using their streaming API, built on Heroku with Node.js.
I have a collection of topics that my app needs to process, which are pulled from MongoDB. I need to track each of these topics via the API, however it needs to be done such that each topic is tracked only once. As each worker process expires after approximately 1 hour, when a worker receives SIGTERM it needs to untrack each topic assigned, and release it back to the pool again.
I've been using RabbitMQ to communicate between app and worker processes, however with this I'm a little stuck. Are there any good examples, or advice you can offer on the correct way to do this?
Couldn't the worker just send a message via the messagequeue to the application when it receives a SIGTERM? According to the heroku docs on shutdown the process is allowed a couple of seconds (10) before it will be forecefully killed.
So you can do something like this:
// listen for SIGTERM sent by heroku
process.on('SIGTERM', function () {
// - notify app that this worker is shutting down
messageQueue.sendSomeMessageAboutShuttingDown();
// - shutdown process (might need to wait for async completion
// of message delivery to not prevent it from being delivered)
process.exit()
});
Alternatively you could break up your work in much smaller chunks and have workers only 'take' work that will run for a couple of minutes or even seconds max. Your main application should be the bookkeeper and if a process doesn't complete its task within a specified time assume it has gone missing and make the task available for another process to handle. You can probably also implement this behavior using confirms in rabbitmq.
RabbitMQ won't do this for you.
It will allow you to distribute the work to another process and/or computer, but it won't provide the kind of mechanism you need to prevent more than one process / computer from working on a particular topic.
What you want is a semaphore - a way to control access to a particular "resource" from multiple processes... a way to ensure only one process is working on a particular resource at a given time. In your case the "resource" will be the topic... but it will still be the resource that you want to control access to.
FWIW, there has been discussion of using RabbitMQ to implement a distributed semaphore in the past:
https://www.rabbitmq.com/blog/2014/02/19/distributed-semaphores-with-rabbitmq/
https://aphyr.com/posts/315-call-me-maybe-rabbitmq
but the general consensus is that this is a bad idea. there are too many edge cases and scenarios in which RabbitMQ will fail to work as proper semaphore.
There are some node.js semaphore libraries available. I would recommend looking at them, and using one of them. Have a single process manage the semaphore and decide which other process can / cannot work on which topic.

How to handle requests that have heavy load?

This is a Brain-Question for advice on which scenario is a smarter approach to tackle situations of heavy lifting on the server end but with a responsive UI for the User.
The setup;
My System consists of two services (written in node); One Frontend Service that listens on Requests from the user and a Background Worker, that does heavy lifting and wont be finished within 1-2 seconds (eg. video conversion, image resizing, gzipping, spidering etc.). The User is connected to the Frontend Service via WebSockets (and normal POST Requests).
Scenario 1;
When a User eg. uploads a video, the Frontend Service only does some simple checks, creates a job in the name of the User for the Background Worker to process and directly responds with status 200. Later on the Worker see's its got work, does the work and finishes the job. It then finds the socket the user is connected to (if any) and sends a "hey, job finished" with the data related to the video conversion job (url, length, bitrate, etc.).
Pros I see: Quick User feedback of sucessfull upload (eg. ProgressBar can be hidden)
Cons I see: User will get a fake "success" respond with no data to handle/display and needs to wait till the job finishes anyway.
Scenario 2;
Like Scenario 1 but that the Frontend Service doesn't respond with a status 200 but rather subscribes to the created job "onComplete" event and lets the Request dangle till the callback is fired and the data can be sent down the pipe to the user.
Pros I see: "onSuccess", all data is at the User
Cons I see: Depending on the job's weight and active job count, the Users request could Timeout
While writing this question things are getting clearer to me by the minute (Scenario 1, but with smart success and update events sent). Regardless, I'd like to hear about other Scenarios you use or further Pros/Cons towards my Scenarios!?
Thanks for helping me out!
Some unnecessary info; For websockets I'm using socket.io, for job creating kue and for pub/sub redis
I just wrote something like this and I use both approaches for different things. Scenario 1 makes most sense IMO because it matches the reality best, which can then be conveyed most accurately to the user. By first responding with a 200 "Yes I got the request and created the 'job' like you requested" then you can accurately update the UI to reflect that the request is being dealt with. You can then use the push channel to notify the user of updates such as progress percentage, error, and success as needed but without the UI 'hanging' (obviously you wouldn't hang the UI in scenario 2 but its an awkward situation that things are happening and the UI just has to 'guess' that the job is being processed).
Scenario 1 -- but instead of responding with 200 OK, you should respond with 202 Accepted. From Wikipedia:
https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
202 Accepted The request has been accepted for processing, but the
processing has not been completed. The request might or might not
eventually be acted upon, as it might be disallowed when processing
actually takes place.
This leaves the door open for the possibility of worker errors. You are just saying you accepted the request and is trying to do something with it.

Resources