Update (skip) wait duration on AWS Step Function - node.js

I have a Step Function set up that has a 'wait' state (eg, 999999 seconds). Once the wait is over, the Step Function invokes a Lambda. Sometimes, I will want to interrupt the wait time and trigger the Lambda immediately. Is this possible?
I thought I could do it by using the aws-sdk with the Step Functions API to manually skip the wait; but I've been experimenting with no success.
I tried the API's Start Execution method, but it is only for starting the entire Step Function (https://docs.aws.amazon.com/step-functions/latest/apireference/API_StartExecution.html) I can't find anything for manipulating individual steps.
I can use GetExecutionHistory to return an event object that describes the Wait step, eg:
{
timestamp: 2022-10-17T08:38:27.849Z,
type: 'WaitStateEntered',
id: 2,
previousEventId: 0,
stateEnteredEventDetails: {
name: 'Wait',
input: '{\n "Comment": "Insert your JSON here"\n}',
inputDetails: {truncated: false}
}
},
But there doesn't seem to be a way to manipulate this event to move to the next step.

I've spoken to AWS tech support who have confirmed that there is nothing in the aws-sdk or the aws-cdk that provides for the update of an existing state (eg, a 'wait' state) while it is running. There are some workarounds:
AWS tech support suggest Iterating a loop using a Lambda. This basically loops over a Choice>Wait>Lambda>(repeat) where the Lambda returns an output that tells the Choice whether to continue with the loop or else direct the Execution to another state. The advantage of this is that we don't need to cancel the Execution and we maintain a simpler record of activities. The disadvantage is that we are regularly invoking a Lambda.
As per #Guy's suggestion, we could split the Step Function into two separate Step Functions. This means we could cancel the initial Step Function and then trigger the latter Step Function manually.
We can cancel the execution of a Step Function with stopExecution. For example, using the aws-sdk:
import { config, Credentials, StepFunctions } from "aws-sdk"; // package.json: "aws-sdk": "^2.1232.0",
config.update({ region: "eu-west-2" });
const stepFunctions = new StepFunctions();
const stoppedExecution = await stepFunctions
.stopExecution({
executionArn: "...",
cause: "...",
error: "...",
})
.promise();
We can then trigger a new Step Function with startExecution
Step Functions also allow us to Wait for a callback with the Task Token. Basically, the Execution step state will send a task token (eg, to a Lambda), then wait to be returned the Task Token. Once received the Execution will proceed to the next step.
There are two ways I can think of proceeding from above item 3.:
a. Configure a Heartbeat Timeout for a Waiting Task. If the Heartbeat Timeout ends without a response token being received, the task fails with a States.Timeout error name. We can (I assume) handle the error in the Task rule to trigger the next step anyway. So the default behaviour is now to trigger the next step after a duration elapses, and then we also have the facility to skip the wait duration by sending the Task Token back to the Execution.
b. Use another Service to perform the wait function and return the Task Token after the wait duration has elapsed.

Option 3 of your answer would still require some service/process to handle/poll whether or not to continue I believe.
Ive implemented a pattern very similar to your description but can be defined in a single sfn defintion.
note: I consider this a hack/abuse of the States Language but it has the benefit of keeping a single state machine definition/execution and prevents paying for excessive state transitions in the looping method:
put your Wait state in a new Parallel state.
add a waitforcallback type task in the Parallel state. (dynamodb, sqs, etc) making sure to configure timeout/heartbeat to the same duration as the "neighboring" Wait state.
If/when you want to "short circuit" the wait duration, query wherever you stored the task token, and send a SendTaskFailure
with a unique cause/error payload.
configure the Catch (FallBack state) for the Parallel state to point to your "Invoke Lambda" state
Also configure the default(?) Next field for the Parallel block to point to your "Invoke Lambda" state
This may not be very intuitive but it relies on the fact that if any state defined in a parallel state fails, that entire block will fail immediately. with some custom error handling though, you can "ignore" the "special sentinel error", thus short circuiting the long wait duration and proceed to your next state.
its def not perfect and youll have to play around with errors/timeouts/heartbeats that make sense for your usecase.
depending on how many executions/transitions you expect, the easiest thing ive found is just making sure the task token ends up in a predictable cloudwatch log group, then query with cloudwatch logs insights when i need to retrieve it again

Related

Bull queue - ability to mark job as "failed" (depending on result) and then optionally retry manually

I have a queue system that uses Bull (https://optimalbits.github.io/bull/), which receives API requests, and then dispatches them to an ERP system consecutively once each request completes. (e.g. To avoid crashing the ERP system when a user generates 20 API requests at once.)
However, sometimes the ERP system fails to process the API commands for a variety of reasons. Currently this is treated as "completed" in the Bull queue, with a custom status of failed.
However, we'd like to be able to be notified of such failures, and be able to manually retry the failed API command.
In the documentation (https://github.com/OptimalBits/bull/blob/master/REFERENCE.md#queueprocess) it seems to indicate the done callback can be called with an error instance - but doesn't give an example...
The callback is called everytime a job is placed in the queue. It is
passed an instance of the job as first argument.
If the callback signature contains the second optional done argument,
the callback will be passed a done callback to be called after the job
has been completed. The done callback can be called with an Error
instance, to signal that the job did not complete successfully, or
with a result as second argument (e.g.: done(null, result);) when the
job is successful. Errors will be passed as a second argument to the
"failed" event; results, as a second argument to the "completed"
event.
If, however, the callback signature does not contain the done
argument, a promise must be returned to signal job completion. If the
promise is rejected, the error will be passed as a second argument to
the "failed" event. If it is resolved, its value will be the
"completed" event's second argument.
[emphasis added]
Here's a sample of the code I'm using where the queue is being processed:
queue.process(async (job, done) => {
switch (job.data.type) {
case 'API_Type':
const workOrder = await createWorkOrder(job.data)
socketService.emiter('API_Type', workOrder, job.data.socketId)
workOrder.status === 'success' ? done(null, workOrder) : placeJobInFailedStatus
break
//...
default:
done()
}
})
Where it says in pseudo code "placeJobInFailedStatus" - how can I instead make it "fail"/"stall"/"pause" just that job in the Bull queue, while continuing with others? (instead of marking it completed)
And is there any way to manually retry a "failed" queue entry, or does it need to be re-added fresh to the queue? (I don't want to just try it again in a few seconds - it may need some manual user input to adjust something first.) I'm not seeing a way to manually retry in the documentation. (There is https://github.com/OptimalBits/bull/blob/master/REFERENCE.md#jobretry, but I'm not following as to how it's invoked. And there is https://github.com/OptimalBits/bull/blob/master/REFERENCE.md#jobpromote, but I'm also not seeing how to put a task into a "delayed" status.)
Before I try designing a separate method to handle such, I'm wondering if some of it is supported natively, as it seems like it should be a common need for a queue system.

How do I determine, at the earliest, if an AWS step function state has failed?

We are implementing a workflow on aws step functions(state machine), that deals with update of user records and possible rollback in case something goes wrong. The state machine does processing in 2 parts:
Part1 - updating
Part2 - rollback
When rollback path is taken by the state machine, the process takes very long. Unacceptable to make the client wait this long. Just before starting the rollback however, the client could be informed. I am trying to figure out a way to achieve this.
I have already tried using describeExecution(). But the fail status changes to FAILED only after the state machine is done executing, which is again very late.
I tried inserting an "SQS send message" step at the point(between part1 and part2) where it is likely to fail. And then polling this queue from the orchestration function(handler of my API endpoint). However, this is not going to work as I may have 100s of requests running in parallel and SQS will eventually fail.
Appreciate an early response.
Cheers.
First I'd recommend you to read up on error handling in step functions: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html .
You could use fallback states (Task, Map, Parallel) and catch the error by adding Catch field like so:
"Catch": [ {
"ErrorEquals": [ "java.lang.Exception" ],
"ResultPath": "$.error-info",
"Next": "RecoveryState"
}, {
"ErrorEquals": [ "States.ALL" ],
"Next": "EndState"
} ]
If you are intending to use API to get the current state of the execution you could use GetExecutionHistory. It will return list of events and you can check the returned array of events for the failures. I.E. taskFailedEventDetails

Should an AWS Lambda function instance in Node.js pick up another request during an async await?

Let's say I've got a queue of requests for my Lambda, and inside the lambda might be an external service call that takes 500ms, which is wrapped in async await like
async callSlowService(serializedObject: string) Promise<void>{
await slowServiceClient.post(serializedObject);
}
Should I expect that my Lambda instance will pick up another request off the queue while awaiting the slow call? I know it'll also spin up new Lambda instances but that's not what I'm talking about interleaving requests on a single instance.
I'm asking because I would think that it should do this, however I'm testing with a sleep function and a load generator and it's not happening. My code actually looks like this:
async someCoreFunction() Promise<void>{
// Business logic
console.log("Before wait");
await sleep(2000);
console.log("After wait");
}
}
const sleep = (milliseconds) => {
return new Promise(resolve => setTimeout(resolve, milliseconds))
};
And while it definitely is taking 2 seconds between the "Before wait" and "After wait" statements, there's no new logs being written in that time.
No.
Lambda as a service is largely unaware of what your code is doing. It simply takes a request, invokes your code and then waits for it to return.
I would not expect AWS to implement a feature like interleaving any time soon. It would require the lambda runtime to have substantial knowledge of how your code behaves (for example, you may be awaiting two concurrent long asynchronous calls within one invocation- so simply interrupting when you hit your first await would be incorrect). It would also cause no end of issues for people using the shared scope outside of the handler for common setup/teardown.
As you pay per invocation and time, I don't really see that there is much difference between interleaving and processing the queue in parallel (which lambda natively supports); considering that time spent awaiting still requires some compute. If interleaving ever happens I'd expect it to be a way for AWS to reduce the drain on their own resources.
n.b. If you are awaiting for a long time in a lambda function then there is probably a better way of doing things. For example, Step Functions provide a great way to kick off and poll long running tasks. Similarly, the pattern of using a session variable in your payload is a good way of allowing a long service to callback into lambda without having the lambda idling.

thread with a forever loop with one inherently asynch operation

I'm trying to understand the semantics of async/await in an infinitely looping worker thread started inside a windows service. I'm a newbie at this so give me some leeway here, I'm trying to understand the concept.
The worker thread will loop forever (until the service is stopped) and it processes an external queue resource (in this case a SQL Server Service Broker queue).
The worker thread uses config data which could be changed while the service is running by receiving commands on the main service thread via some kind of IPC. Ideally the worker thread should process those config changes while waiting for the external queue messages to be received. Reading from service broker is inherently asynchronous, you literally issue a "waitfor receive" TSQL statement with a receive timeout.
But I don't quite understand the flow of control I'd need to use to do that.
Let's say I used a concurrentQueue to pass config change messages from the main thread to the worker thread. Then, if I did something like...
void ProcessBrokerMessages() {
foreach (BrokerMessage m in ReadBrokerQueue()) {
ProcessMessage(m);
}
}
// ... inside the worker thread:
while (!serviceStopped) {
foreach (configChange in configChangeConcurrentQueue) {
processConfigChange(configChange);
}
ProcessBrokerMessages();
}
...then the foreach loop to process config changes and the broker processing function need to "take turns" to run. Specifically, the config-change-processing loop won't run while the potentially-long-running broker receive command is running.
My understanding is that simply turning the ProcessBrokerMessages() into an async method doesn't help me in this case (or I don't understand what will happen). To me, with my lack of understanding, the most intuitive interpretation seems to be that when I hit the async call it would go off and do its thing, and execution would continue with a restart of the outer while loop... but that would mean the loop would also execute the ProcessBrokerMessages() function over and over even though it's already running from the invocation in the previous loop, which I don't want.
As far as I know this is not what would happen, though I only "know" that because I've read something along those lines. I don't really understand it.
Arguably the existing flow of control (ie, without the async call) is OK... if config changes affect ProcessBrokerMessages() function (which they can) then the config can't be changed while the function is running anyway. But that seems like it's a point specific to this particular example. I can imagine a case where config changes are changing something else that the thread does, unrelated to the ProcessBrokerMessages() call.
Can someone improve my understanding here? What's the right way to have
a block of code which loops over multiple statements
where one (or some) but not all of those statements are asynchronous
and the async operation should only ever be executing once at a time
but execution should keep looping through the rest of the statements while the single instance of the async operation runs
and the async method should be called again in the loop if the previous invocation has completed
It seems like I could use a BackgroundWorker to run the receive statement, which flips a flag when its job is done, but it also seems weird to me to create a thread specifically for processing the external resource and then, within that thread, create a BackgroundWorker to actually do that job.
You could use a CancelationToken. Most async functions accept one as a parameter, and they cancel the call (the returned Task actually) if the token is signaled. SqlCommand.ExecuteReaderAsync (which you're likely using to issue the WAITFOR RECEIVE is no different. So:
Have a cancellation token passed to the 'execution' thread.
The settings monitor (the one responding to IPC) also has a reference to the token
When a config change occurs, the monitoring makes the config change and then signals the token
the execution thread aborts any pending WAITFOR (or any pending processing in the message processing loop actually, you should use the cancellation token everywhere). any transaction is aborted and rolled back
restart the execution thread, with new cancellation token. It will use the new config
So in this particular case I decided to go with a simpler shared state solution. This is of course a less sound solution in principle, but since there's not a lot of shared state involved, and since the overall application isn't very complicated, it seemed forgivable.
My implementation here is to use locking, but have writes to the config from the service main thread wrapped up in a Task.Run(). The reader doesn't bother with a Task since the reader is already in its own thread.

How to specify a timeout value on HttpWebRequest.BeginGetResponse without blocking the thread

I’m trying to issue web requests asynchronously. I have my code working fine except for one thing: There doesn’t seem to be a built-in way to specify a timeout on BeginGetResponse. The MSDN example clearly show a working example but the downside to it is they all end up with a
SomeObject.WaitOne()
Which again clearly states it blocks the thread. I will be in a high load environment and can’t have blocking but I also need to timeout a request if it takes more than 2 seconds. Short of creating and managing a separate thread pool, is there something already present in the framework that can help me?
Starting examples:
http://msdn.microsoft.com/en-us/library/ms227433(VS.100).aspx
http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.begingetresponse.aspx
What I would like is a way for the async callback on BeginGetResponse() to be invoked after my timeout parameter expires, with some indication that a timeout occurred.
The seemingly obvious TimeOut parameter is not honored on async calls.
The ReadWriteTimeout parameter doesn't come into play until the response returns.
A non-proprietary solution would be preferable.
EDIT:
Here's what I came up with: after calling BeginGetResponse, I create a Timer with my duration and that's the end of the "begin" phase of processing. Now either the request will complete and my "end" phase will be called OR the timeout period will expire.
To detect the race and have a single winner I call increment a "completed" counter in a thread-safe manner. If "timeout" is the 1st event to come back, I abort the request and stop the timer. In this situation, when "end" is called the EndGetResponse throws an error. If the "end" phase happens first, it increments the counter and the "timeout" foregoes aborting the request.
This seems to work like I want while also providing a configurable timeout. The downside is the extra timer object and the callbacks which I make no effort to avoid. I see 1-3 threads processing various portions (begin, timed out, end) so it seems like this working. And I don't have any "wait" calls.
Have I missed too much sleep or have I found a way to service my requests without blocking?
int completed = 0;
this.Request.BeginGetResponse(GotResponse, this.Request);
this.timer = new Timer(Timedout, this, TimeOutDuration, Timeout.Infinite);
private void Timedout(object state)
{
if (Interlocked.Increment(ref completed) == 1)
{
this.Request.Abort();
}
this.timer.Change(Timeout.Infinite, Timeout.Infinite);
this.timer.Dispose();
}
private void GotRecentSearches(IAsyncResult result)
{
Interlocked.Increment(ref completed);
}
You can to use a BackgroundWorker to run your HttpWebRequest into a separated thread, so your main thread still alive. So, this background thread will be blocked, but first one don't.
In this context, you can to use a ManualResetEvent.WaitOne() just like in that sample: HttpWebRequest.BeginGetResponse() method.
What kind of an application is this? Is this a service proces/ web application/console app?
How are you creating your work load (i.e requests)? If you have a queue of work that needs to be done, you can start off 'N' number of async requests (with the framework for timeouts that you have built) and then, once each request completes (either with timeout or success) you can grab the next request from the queue.
This will thus become a Producer/consumer pattern.
So, if you configure your application to have a maximum of "N' requests outstanding, you can maintain a pool of 'N' timers that you reuse (without disposing) between the requests.
Or, alternately, you can use ThreadPool.SetTimerQueueTimer() to manage your timers. The threadpool will manage the timers for you and reuse the timer between requests.
Hope this helps.
Seems like my original approach is the best thing available.
If you can user async/await then
private async Task<WebResponse> getResponseAsync(HttpWebRequest request)
{
var responseTask = Task.Factory.FromAsync(request.BeginGetResponse, ar => (HttpWebResponse)request.EndGetResponse(ar), null);
var winner = await (Task.WhenAny(responseTask, Task.Delay(new TimeSpan(0, 0, 20))));
if (winner != responseTask)
{
throw new TimeoutException();
}
return await responseTask;
}

Resources