AWS Step Functions activity task state timeouts limited functionality, conflicting documentation - state-machine

I am running into stuck executions on my Activity Task States even though they have timeouts set. It turns out, it seems this timeout setting is pretty pointless if your external Activity Worker is not even running, and you want to catch that scenario by timing out. This is because their documentation clearly states that the timeout setting only begins counting when an external Activity Worker calls the GetActivityTask API, which triggers and "ActivityStarted" execution event.
https://docs.aws.amazon.com/step-functions/latest/dg/troubleshooting-activities.html#troubleshooting-activities-stuck-state-machine
Moreover, they have documentation here above that states to workaround your executions gettin stuck in this way, use timeouts (which is exactly what I'm doing) 🤦
Am I missing something or is this an obvious gap in their functionality for Activity Tasks?

Related

Scaling an Azure Elastic Pool with .NET Fluent API

I'm using the Azure Fluent API, Azure Management Libraries for .NET, to scale the DTU's within an Azure Elastic Pool and would like to know if it's possible to trigger an update without having to wait for the processing to complete.
Currently the following block of code will wait until the Elastic Pool has finished scaling before it continues execution. With a large premium Elastic Pool this could mean that the this line will take up to 90 minutes to complete.
ElasticPool
.Update()
.WithDtu(1000)
.Apply();
There's also a ApplyAsync() method which i could deliberately not await to allow the program to continue execution, if i take this approach the program will end execution shortly after calling this line and i am unsure if this library has been designed to work in this fashion.
Does anyone know of a better solution to trigger an update without having to wait on a response? Or if it is safe to fire the async method without waiting for a response?
There is currently no way to make a fire and forget calls in the Fluent SDK for update scenarios but we are looking to the ways of enabling a manual status polling in the future. One option would be to create a thread that will wait on the completion. The other one is to use the Inner getter and make a low level BeginCreateOrUpdateAsync/BeginUpdateAsync method calls and then do manual polls.
On the side note if you need to make multiple calls and then wait for completion of all of them you can use Task.WaitAll(...) and provide the list of the ApplyAsync tasks.
Please log an issue in the repo if you will hit any errors because that way you will be able to track the progress of the fix.
edit: FYI the call is blocking not because SDK is waiting for the response from Azure but that SDK waits until the call is completed, operation of update is finished and the resource is ready to be used for further operations. Just firing an update and then trying to use resource will cause error responses if in your case Elastic Pool is still in the middle of the update.

Time triggered Azure Function and queue processing

I have a Function called once a day processing all messages in a queue. But I would like to also have the retries and poison messages logic as with the Queue Trigger. Is this somehow possible?
So at that point you function is purely a timer triggered function and from there you are no different than a console app in terms of how you would have to process messages from a queue with your own client connection, message loop, retrying and dead lettering (poison) logic. It's honestly just not the right tool for that job.
One approach I suppose you could consider if you wanted to be creative so that you could benefit from using an Azure Function's built in queue trigger behavior while still controlling what time the queue is processed is actually starting and stopping the function instance itself via something like Azure Scheduler. Scheduling the starting of the function is pretty straightforward and, once started, it will immediately begin draining the queue. The challenge is knowing how to stop it. The Azure Function runtime with the queue binding won't ever stop on its own as it's reading off the queue using a pull model so it's just gonna sit there waiting for new messages to arrive, right? So stopping it is really a question of understanding the need of the business. Do you stop once there's no more messages left for that day? Do you stop at a specific time of day? Etc. There's no correct answer here, it's totally domain specific, but whatever that answer is will dictate the exact approach taken.
Honestly, as I said earlier on, I'm not sure this is the right tool for the job. Yeah, it's nice that you get the retry and poison handling, but you're really going against the grain of the runtime. I would personally suggest you look into scheduling this simple console executable, executed in a task-like fashion using Azure Container Instances (some docs on that approach here). If you were counting on the auto-scale of Azure Functions, that's totally something you can get from Azure Container Instances as well.

Does Node.js need a job queue?

Say I have a express service which sends email:
app.post('/send', function(req, res) {
sendEmailAsync(req.body).catch(console.error)
res.send('ok')
})
this works.
I'd like to know what's the advantage of introducing a job queue here? like Kue.
Does Node.js need a job queue?
Not generically.
A job queue is to solve a specific problem, usually with more to do than a single node.js process can handle at once so you "queue" up things to do and may even dole them out to other processes to handle.
You may even have priorities for different types of jobs or want to control the rate at which jobs are executed (suppose you have a rate limit cap you have to remain below on some external server or just don't want to overwhelm some other server). One can also use nodejs clustering to increase the amount of tasks that your node server can handle. So, a queue is about controlling the execution of some CPU or resource intensive task when you have more of it to do than your server can easily execute at once. A queue gives you control over the flow of execution.
I don't see any reason for the code you show to use a job queue unless you were doing a lot of these all at once.
The specific https://github.com/OptimalBits/bull library or Kue library you mention lists these features on its NPM page:
Delayed jobs
Distribution of parallel work load
Job event and progress pubsub
Job TTL
Optional retries with backoff
Graceful workers shutdown
Full-text search capabilities
RESTful JSON API
Rich integrated UI
Infinite scrolling
UI progress indication
Job specific logging
So, I think it goes without saying that you'd add a queue if you needed some specific queuing features and you'd use the Kue library if it had the best set of features for your particular problem.
In case it matters, your code is sending res.send("ok") before it finishes with the async tasks and before you know if it succeeded or not. Sometimes there are reasons for doing that, but sometimes you want to communicate back whether the operation was successful or not (which you are not doing).
Basically, the point of a queue would simply be to give you more control over their execution.
This could be for things like throttling how many you send, giving priority to other actions first, evening out the flow (i.e., if 10000 get sent at the same time, you don't try to send all 10000 at the same time and kill your server).
What exactly you use your queue for, and whether it would be of any benefit, depends on your actual situation and use cases. At the end of the day, it's just about controlling the flow.

Error conditions and retries in gearman?

Can someone guide me on how gearman does retries when exceptions are
thrown or when errors occur?
I use the python gearman client in a Django app and my workers are
initiated as a Django command. I read from this blog post that retries
from error conditions are not straight forward and that it requires
sys.exit from the worker side.
Has this been fixed to retry perhaps with sendFail or sendException?
Also does gearman support retries with exponentials algorithm – say if
an SMTP failure happens its retries after 2,4,8,16 seconds etc?
To my understanding, Gearman employs a very "it's not my business" approach - e.g., it does not intervene with jobs performed, unless workers crash. Any success / failure messages are supposed to be handled by the client, not Gearman server itself.
In foreground jobs, this implies that all sendFail() / sendException() and other send*() are directed to the client and it's up to the client to decide whether to retry the job or not. This makes sense as sometimes you might not need to retry.
In background jobs, all the send*() functions lose their meaning, as there is no client that would be listening to the callbacks. As a result, the messages sent will be just ignored by Gearman. The only condition on which the job will be retried is when the worker crashes (which can by emulated with a exit(XX) command, where XX is a non-zero value). This, of course, is not something you want to do, because workers are usually supposed to be long-running processes, not the ones that have to be restarted after each unsuccessful job.
Personally, I have solved this problem by extending the default GearmanJob class, where I intercept the calls to send*() functions and then implementing the retry mechanism myself. Essentially, I pass all the retry-related data (max number of retries, times already retried) together with a workload and then handle everything myself. It is a bit cumbersome, but I understand why Gearman works this way - it just allows you to handle all the application logic.
Finally, regarding the ability to retry jobs with exponential timeout (or any timeout for that matter). Gearman has a feature to add delayed jobs (look for SUBMIT_JOB_EPOCH in the protocol documentation), yet I am not sure about its status - the PHP extension and, I think, the Python module do not support it and the docs say it can be removed in the future. But I understand it works at the moment - you just need to submit raw socket requests to Gearman to make it happen (and the exponential part should be implemented on your side, too).
However, this blog post argues that SUBMIT_JOB_EPOCH implementation does not scale well. He uses node.js and setTimeout() to make it work, I've seen others use the unix utility at to do the same. In any way - Gearman will not do it for you. It will focus on reliability, but will let you focus on all the logic.

Why Google App Engine Tasks can spuriously be executed more than once?

Why Google App Engine Tasks can be executed more than once?
According do Brett Slatkin talk from Google I/O 2009, it is possible for a task to spuriously run twice even without server failures!
This has something to do with spurious wakeup of threads?
Brant Slatkin gave a similar talk at I/0 2010.
I don't know that he ever gave details of how or when this could happen. His point was that because of the way Task Queues work it is possible by design for tasks to be reenqueued. Because of this you need to write your tasks so that it does not cause problems if that happens.
For example, let's say you have a task that sends an email and then increments a counter in Datastore. If there was a bug in your code OR if Datastore was down, it is possible for the email to be sent successfully but for the write to Datastore to fail. If you didn't handle the failure from Datastore in your code by handling the exception the failure to write to Datastore would result in your task returning a HTTP status code of 500. Task Queue is designed to reenqueue the task if it returns a status code >299. This would result in your task being executed over and over until the write to datastore was successful. Which means that someone would get many duplicate emails.
I think the line about "Possible for a task to spuriously run twice.." was just a way to say App Engine isn't guaranteed to protect against this so you need to make sure you take care of it in your code.

Resources