How do I determine, at the earliest, if an AWS step function state has failed? - state-machine

We are implementing a workflow on aws step functions(state machine), that deals with update of user records and possible rollback in case something goes wrong. The state machine does processing in 2 parts:
Part1 - updating
Part2 - rollback
When rollback path is taken by the state machine, the process takes very long. Unacceptable to make the client wait this long. Just before starting the rollback however, the client could be informed. I am trying to figure out a way to achieve this.
I have already tried using describeExecution(). But the fail status changes to FAILED only after the state machine is done executing, which is again very late.
I tried inserting an "SQS send message" step at the point(between part1 and part2) where it is likely to fail. And then polling this queue from the orchestration function(handler of my API endpoint). However, this is not going to work as I may have 100s of requests running in parallel and SQS will eventually fail.
Appreciate an early response.
Cheers.

First I'd recommend you to read up on error handling in step functions: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html .
You could use fallback states (Task, Map, Parallel) and catch the error by adding Catch field like so:
"Catch": [ {
"ErrorEquals": [ "java.lang.Exception" ],
"ResultPath": "$.error-info",
"Next": "RecoveryState"
}, {
"ErrorEquals": [ "States.ALL" ],
"Next": "EndState"
} ]
If you are intending to use API to get the current state of the execution you could use GetExecutionHistory. It will return list of events and you can check the returned array of events for the failures. I.E. taskFailedEventDetails

Related

Bull queue - ability to mark job as "failed" (depending on result) and then optionally retry manually

I have a queue system that uses Bull (https://optimalbits.github.io/bull/), which receives API requests, and then dispatches them to an ERP system consecutively once each request completes. (e.g. To avoid crashing the ERP system when a user generates 20 API requests at once.)
However, sometimes the ERP system fails to process the API commands for a variety of reasons. Currently this is treated as "completed" in the Bull queue, with a custom status of failed.
However, we'd like to be able to be notified of such failures, and be able to manually retry the failed API command.
In the documentation (https://github.com/OptimalBits/bull/blob/master/REFERENCE.md#queueprocess) it seems to indicate the done callback can be called with an error instance - but doesn't give an example...
The callback is called everytime a job is placed in the queue. It is
passed an instance of the job as first argument.
If the callback signature contains the second optional done argument,
the callback will be passed a done callback to be called after the job
has been completed. The done callback can be called with an Error
instance, to signal that the job did not complete successfully, or
with a result as second argument (e.g.: done(null, result);) when the
job is successful. Errors will be passed as a second argument to the
"failed" event; results, as a second argument to the "completed"
event.
If, however, the callback signature does not contain the done
argument, a promise must be returned to signal job completion. If the
promise is rejected, the error will be passed as a second argument to
the "failed" event. If it is resolved, its value will be the
"completed" event's second argument.
[emphasis added]
Here's a sample of the code I'm using where the queue is being processed:
queue.process(async (job, done) => {
switch (job.data.type) {
case 'API_Type':
const workOrder = await createWorkOrder(job.data)
socketService.emiter('API_Type', workOrder, job.data.socketId)
workOrder.status === 'success' ? done(null, workOrder) : placeJobInFailedStatus
break
//...
default:
done()
}
})
Where it says in pseudo code "placeJobInFailedStatus" - how can I instead make it "fail"/"stall"/"pause" just that job in the Bull queue, while continuing with others? (instead of marking it completed)
And is there any way to manually retry a "failed" queue entry, or does it need to be re-added fresh to the queue? (I don't want to just try it again in a few seconds - it may need some manual user input to adjust something first.) I'm not seeing a way to manually retry in the documentation. (There is https://github.com/OptimalBits/bull/blob/master/REFERENCE.md#jobretry, but I'm not following as to how it's invoked. And there is https://github.com/OptimalBits/bull/blob/master/REFERENCE.md#jobpromote, but I'm also not seeing how to put a task into a "delayed" status.)
Before I try designing a separate method to handle such, I'm wondering if some of it is supported natively, as it seems like it should be a common need for a queue system.

Update (skip) wait duration on AWS Step Function

I have a Step Function set up that has a 'wait' state (eg, 999999 seconds). Once the wait is over, the Step Function invokes a Lambda. Sometimes, I will want to interrupt the wait time and trigger the Lambda immediately. Is this possible?
I thought I could do it by using the aws-sdk with the Step Functions API to manually skip the wait; but I've been experimenting with no success.
I tried the API's Start Execution method, but it is only for starting the entire Step Function (https://docs.aws.amazon.com/step-functions/latest/apireference/API_StartExecution.html) I can't find anything for manipulating individual steps.
I can use GetExecutionHistory to return an event object that describes the Wait step, eg:
{
timestamp: 2022-10-17T08:38:27.849Z,
type: 'WaitStateEntered',
id: 2,
previousEventId: 0,
stateEnteredEventDetails: {
name: 'Wait',
input: '{\n "Comment": "Insert your JSON here"\n}',
inputDetails: {truncated: false}
}
},
But there doesn't seem to be a way to manipulate this event to move to the next step.
I've spoken to AWS tech support who have confirmed that there is nothing in the aws-sdk or the aws-cdk that provides for the update of an existing state (eg, a 'wait' state) while it is running. There are some workarounds:
AWS tech support suggest Iterating a loop using a Lambda. This basically loops over a Choice>Wait>Lambda>(repeat) where the Lambda returns an output that tells the Choice whether to continue with the loop or else direct the Execution to another state. The advantage of this is that we don't need to cancel the Execution and we maintain a simpler record of activities. The disadvantage is that we are regularly invoking a Lambda.
As per #Guy's suggestion, we could split the Step Function into two separate Step Functions. This means we could cancel the initial Step Function and then trigger the latter Step Function manually.
We can cancel the execution of a Step Function with stopExecution. For example, using the aws-sdk:
import { config, Credentials, StepFunctions } from "aws-sdk"; // package.json: "aws-sdk": "^2.1232.0",
config.update({ region: "eu-west-2" });
const stepFunctions = new StepFunctions();
const stoppedExecution = await stepFunctions
.stopExecution({
executionArn: "...",
cause: "...",
error: "...",
})
.promise();
We can then trigger a new Step Function with startExecution
Step Functions also allow us to Wait for a callback with the Task Token. Basically, the Execution step state will send a task token (eg, to a Lambda), then wait to be returned the Task Token. Once received the Execution will proceed to the next step.
There are two ways I can think of proceeding from above item 3.:
a. Configure a Heartbeat Timeout for a Waiting Task. If the Heartbeat Timeout ends without a response token being received, the task fails with a States.Timeout error name. We can (I assume) handle the error in the Task rule to trigger the next step anyway. So the default behaviour is now to trigger the next step after a duration elapses, and then we also have the facility to skip the wait duration by sending the Task Token back to the Execution.
b. Use another Service to perform the wait function and return the Task Token after the wait duration has elapsed.
Option 3 of your answer would still require some service/process to handle/poll whether or not to continue I believe.
Ive implemented a pattern very similar to your description but can be defined in a single sfn defintion.
note: I consider this a hack/abuse of the States Language but it has the benefit of keeping a single state machine definition/execution and prevents paying for excessive state transitions in the looping method:
put your Wait state in a new Parallel state.
add a waitforcallback type task in the Parallel state. (dynamodb, sqs, etc) making sure to configure timeout/heartbeat to the same duration as the "neighboring" Wait state.
If/when you want to "short circuit" the wait duration, query wherever you stored the task token, and send a SendTaskFailure
with a unique cause/error payload.
configure the Catch (FallBack state) for the Parallel state to point to your "Invoke Lambda" state
Also configure the default(?) Next field for the Parallel block to point to your "Invoke Lambda" state
This may not be very intuitive but it relies on the fact that if any state defined in a parallel state fails, that entire block will fail immediately. with some custom error handling though, you can "ignore" the "special sentinel error", thus short circuiting the long wait duration and proceed to your next state.
its def not perfect and youll have to play around with errors/timeouts/heartbeats that make sense for your usecase.
depending on how many executions/transitions you expect, the easiest thing ive found is just making sure the task token ends up in a predictable cloudwatch log group, then query with cloudwatch logs insights when i need to retrieve it again

How to handle multiple post requests at the same time while saving one of them on the db?

I am getting n post requests (on each webhook trigger) from a webhook. The data is identical on all requests that come from the same trigger - they all have the same 'orderId'. I'm interested in saving only one of these requests, so on each endpoint hit I'm checking if this specific orderId exists as a row in my database, otherwise - create it.
if (await orderIdExists === null) {
await Order.create(
{
userId,
status: PENDING,
price,
...
}
);
await sleep(3000)
function sleep(ms) {
return new Promise((resolve) => {
setTimeout(resolve, ms);
});
}
}
return res.status(HttpStatus.OK).send({success: true})
} catch (error) {
return res.status(HttpStatus.INTERNAL_SERVER_ERROR).send({success: false})
}
}
else {
return res.status(HttpStatus.UNAUTHORIZED).send(responseBuilder(false, responseErrorCodes.INVALID_API_KEY, {}, req.t));
}
}
Problem is before Sequelize manages to save the new created order in the db (all of the n post requests get to the enpoint in 1 sec - or less), I already get another endpoint hit from the other n post requests, while orderIdExists still equels null, So it ends up creating more identical orders. One (not so good solution) is to make orderId unique in the db, which prevents the creation of on order with the same orderId, but tries to anyway, which results in empty id incrementation in the db. Any idea would be greatly appreciated.
p.s. as you can see, i tried adding a 'sleep' function to no avail.
Your database is failing to complete its save operation before the next request arrives. The problem is similar to the Dogpile Effect or a "cache slam".
This requires some more thinking about how you are framing the problem: in other words the "solution" will be more philosophical and perhaps have less to do with code, so your results on StackOverflow may vary.
The "sleep" solution is no solution at all: there's no guarantee how long the database operation might take or how long you might wait before another duplicate request arrives. As a rule of thumb, any time "sleep" is deployed as a "solution" to problems of concurrency, it usually is the wrong choice.
Let me posit two possible ways of dealing with this:
Option 1: write-only: i.e. don't try to "solve" this by reading from the database before you write to it. Just keep the pipeline leading into the database as dumb as possible and keep writing. E.g. consider a "logging" table that just stores whatever the webhook throws at it -- don't try to read from it, just keep inserting (or upserting). If you get 100 ping-backs about a specific order, so be it: your table would log it all and if you end up with 100 rows for a single orderId, let some other downstream process worry about what to do with all that duplicated data. Presumably, Sequelize is smart enough (and your database supports whatever process locking) to queue up the operations and deal with write repetitions.
An upsert operation here would be helpful if you do want to have a unique constraint on the orderId (this seems sensible, but you may be aware of other considerations in your particular setup).
Option 2: use a queue. This is decidedly more complex, so weigh carefully wether or not your use-case justifies the extra work. Instead of writing data immediately to the database, throw the webhook data into a queue (e.g. a first-in-first-out FIFO queue). Ideally, you would want to choose a queue that supports de-duplication so that exiting messages are guaranteed to be unique, but that infers state, and that usually relies on a database of some sort, which is sort of the problem to begin with.
The most important thing a queue would do for you is it would serialize the messages so you can deal with them one at a time (instead of multiple database operations kicking off concurrently). You can upsert data into the database when you read a message out of the queue. If the webhook keeps firing and more messages enter the queue, that's fine because the queue forces them all to line up single-file and you can handle each insertion one at a time. You'll know that each database operation has completed before it moves on to the next message so you never "slam" the DB. In other words, putting a queue in front of the database will allow it to handle data when the database is ready instead of whenever a webhook comes calling.
The idea of a queue here is similar to what a semaphore accomplishes. Note that your database interface may already implement a kind of queue/pool under-the-hood, so weigh this option carefully: don't reinvent a wheel.
Hope those ideas are useful.
You saved my time #Everett and #april-henig. I found that saving directly into database read to records duplicates. If you store records into an object and deal with one record at time helped me a lot.
May be I would share my solution perhaps some may find it useful in future.
Create an empty object to save success request
export const queueAllSuccessCallBack = {};
Save POST request in object
if (status === 'success') { // I checked the request if is only successfully
const findKeyTransaction = queueAllSuccessCallBack[client_reference_id];
if (!findKeyTransaction) { // check if Id is not added to avoid any duplicates
queueAllSuccessCallBack[client_reference_id] = {
transFound,
body,
}; // save new request id as key and the value as data you want
}
}
Access the object to save into database
const keys = Object.keys(queueAllSuccessCallBack);
keys.forEach(async (key) => {
...
// Do extra checks if you want to do so
// Or save in database direct
});

thread with a forever loop with one inherently asynch operation

I'm trying to understand the semantics of async/await in an infinitely looping worker thread started inside a windows service. I'm a newbie at this so give me some leeway here, I'm trying to understand the concept.
The worker thread will loop forever (until the service is stopped) and it processes an external queue resource (in this case a SQL Server Service Broker queue).
The worker thread uses config data which could be changed while the service is running by receiving commands on the main service thread via some kind of IPC. Ideally the worker thread should process those config changes while waiting for the external queue messages to be received. Reading from service broker is inherently asynchronous, you literally issue a "waitfor receive" TSQL statement with a receive timeout.
But I don't quite understand the flow of control I'd need to use to do that.
Let's say I used a concurrentQueue to pass config change messages from the main thread to the worker thread. Then, if I did something like...
void ProcessBrokerMessages() {
foreach (BrokerMessage m in ReadBrokerQueue()) {
ProcessMessage(m);
}
}
// ... inside the worker thread:
while (!serviceStopped) {
foreach (configChange in configChangeConcurrentQueue) {
processConfigChange(configChange);
}
ProcessBrokerMessages();
}
...then the foreach loop to process config changes and the broker processing function need to "take turns" to run. Specifically, the config-change-processing loop won't run while the potentially-long-running broker receive command is running.
My understanding is that simply turning the ProcessBrokerMessages() into an async method doesn't help me in this case (or I don't understand what will happen). To me, with my lack of understanding, the most intuitive interpretation seems to be that when I hit the async call it would go off and do its thing, and execution would continue with a restart of the outer while loop... but that would mean the loop would also execute the ProcessBrokerMessages() function over and over even though it's already running from the invocation in the previous loop, which I don't want.
As far as I know this is not what would happen, though I only "know" that because I've read something along those lines. I don't really understand it.
Arguably the existing flow of control (ie, without the async call) is OK... if config changes affect ProcessBrokerMessages() function (which they can) then the config can't be changed while the function is running anyway. But that seems like it's a point specific to this particular example. I can imagine a case where config changes are changing something else that the thread does, unrelated to the ProcessBrokerMessages() call.
Can someone improve my understanding here? What's the right way to have
a block of code which loops over multiple statements
where one (or some) but not all of those statements are asynchronous
and the async operation should only ever be executing once at a time
but execution should keep looping through the rest of the statements while the single instance of the async operation runs
and the async method should be called again in the loop if the previous invocation has completed
It seems like I could use a BackgroundWorker to run the receive statement, which flips a flag when its job is done, but it also seems weird to me to create a thread specifically for processing the external resource and then, within that thread, create a BackgroundWorker to actually do that job.
You could use a CancelationToken. Most async functions accept one as a parameter, and they cancel the call (the returned Task actually) if the token is signaled. SqlCommand.ExecuteReaderAsync (which you're likely using to issue the WAITFOR RECEIVE is no different. So:
Have a cancellation token passed to the 'execution' thread.
The settings monitor (the one responding to IPC) also has a reference to the token
When a config change occurs, the monitoring makes the config change and then signals the token
the execution thread aborts any pending WAITFOR (or any pending processing in the message processing loop actually, you should use the cancellation token everywhere). any transaction is aborted and rolled back
restart the execution thread, with new cancellation token. It will use the new config
So in this particular case I decided to go with a simpler shared state solution. This is of course a less sound solution in principle, but since there's not a lot of shared state involved, and since the overall application isn't very complicated, it seemed forgivable.
My implementation here is to use locking, but have writes to the config from the service main thread wrapped up in a Task.Run(). The reader doesn't bother with a Task since the reader is already in its own thread.

Concurrency between Meteor.setTimeout and Meteor.methods

In my Meteor application to implement a turnbased multiplayer game server, the clients receive the game state via publish/subscribe, and can call a Meteor method sendTurn to send turn data to the server (they cannot update the game state collection directly).
var endRound = function(gameRound) {
// check if gameRound has already ended /
// if round results have already been determined
// --> yes:
do nothing
// --> no:
// determine round results
// update collection
// create next gameRound
};
Meteor.methods({
sendTurn: function(turnParams) {
// find gameRound data
// validate turnParams against gameRound
// store turn (update "gameRound" collection object)
// have all clients sent in turns for this round?
// yes --> call "endRound"
// no --> wait for other clients to send turns
}
});
To implement a time limit, I want to wait for a certain time period (to give clients time to call sendTurn), and then determine the round result - but only if the round result has not already been determined in sendTurn.
How should I implement this time limit on the server?
My naive approach to implement this would be to call Meteor.setTimeout(endRound, <roundTimeLimit>).
Questions:
What about concurrency? I assume I should update collections synchronously (without callbacks) in sendTurn and endRound (?), but would this be enough to eliminate race conditions? (Reading the 4th comment on the accepted answer to this SO question about synchronous database operations also yielding, I doubt that)
In that regard, what does "per request" mean in the Meteor docs in my context (the function endRound called by a client method call and/or in server setTimeout)?
In Meteor, your server code runs in a single thread per request, not in the asynchronous callback style typical of Node.
In a multi-server / clustered environment, (how) would this work?
Great question, and it's trickier than it looks. First off I'd like to point out that I've implemented a solution to this exact problem in the following repos:
https://github.com/ldworkin/meteor-prisoners-dilemma
https://github.com/HarvardEconCS/turkserver-meteor
To summarize, the problem basically has the following properties:
Each client sends in some action on each round (you call this sendTurn)
When all clients have sent in their actions, run endRound
Each round has a timer that, if it expires, automatically runs endRound anyway
endRound must execute exactly once per round regardless of what clients do
Now, consider the properties of Meteor that we have to deal with:
Each client can have exactly one outstanding method to the server at a time (unless this.unblock() is called inside a method). Following methods wait for the first.
All timeout and database operations on the server can yield to other fibers
This means that whenever a method call goes through a yielding operation, values in Node or the database can change. This can lead to the following potential race conditions (these are just the ones I've fixed, but there may be others):
In a 2-player game, for example, two clients call sendTurn at exactly same time. Both call a yielding operation to store the turn data. Both methods then check whether 2 players have sent in their turns, finding the affirmative, and then endRound gets run twice.
A player calls sendTurn right as the round times out. In that case, endRound is called by both the timeout and the player's method, resulting running twice again.
Incorrect fixes to the above problems can result in starvation where endRound never gets called.
You can approach this problem in several ways, either synchronizing in Node or in the database.
Since only one Fiber can actually change values in Node at a time, if you don't call a yielding operation you are guaranteed to avoid possible race conditions. So you can cache things like the turn states in memory instead of in the database. However, this requires that the caching is done correctly and doesn't carry over to clustered environments.
Move the endRound code outside of the method call itself, using something else to trigger it. This is the approach I've taken which ensures that only the timer or the final player triggers the end of the round, not both (see here for an implementation using observeChanges).
In a clustered environment you will have to synchronize using only the database, probably with conditional update operations and atomic operators. Something like the following:
var currentVal;
while(true) {
currentVal = Foo.findOne(id).val; // yields
if( Foo.update({_id: id, val: currentVal}, {$inc: {val: 1}}) > 0 ) {
// Operation went as expected
// (your code here, e.g. endRound)
break;
}
else {
// Race condition detected, try again
}
}
The above approach is primitive and probably results in bad database performance under high loads; it also doesn't handle timers, but I'm sure with some thinking you can figure out how to extend it to work better.
You may also want to see this timers code for some other ideas. I'm going to extend it to the full setting that you described once I have some time.

Resources