I'm trying out the AWS step function. What I'm trying to create.
Get a list of endpoints from the dynamoDB (https://user:password#server1.com, https://user2:password#server2.com, etc..)
From each domain, I get a list of ids. /all
For each id in the result, I want to do a series of REST etc https://user:password#server1.com/device/{id} (Only one request at the time to each domain in parallel)
Save the result in a dynamoDB and check if it is duplicated result or not.
I know how to make the rest call and saving to the dynamoDB etc.
But the problem or unable to find the answer to is.
How can I start run /all in parallel for each domain in the array I get from the dynamoDB?
AWS Step Functions have an immutable state. Once created, they cannot be changed. Given this fact, you cannot have a dynamic number of branches in your Parallel state.
To solve for this, you'll probably want to approach your design a little differently. Instead of solving this with a single Step Function, consider breaking it apart into two different state machines, as shown below.
Step Function #1: Retrieve List of Endpoints
Start
Task: Retrieves list of endpoints from DynamoDB
Task: For each endpoint, invoke Step Function #2 and pass in endpoint
End
You could optionally combine states #2 and #3 to simplify the state machine and your task code.
Step Function #2: Perform REST API Calls
Start - takes a single endpoint as input state
Task: Perform series of REST calls against endpoint
Task: Stores result in DynamoDB via Task state
End
Related
I've a DLT pipeline where in it creates Delta table by reading from sql server and then we call few apis to update metadata in our cosmos. Whenever we start it, it gets struck in initialising state.
But when we run same code using interactive cluster in a stand alone notebook, it works fine.
Can someone help me to understand this issue ?
DLT Pipeline shouldn't struck in initialising state
The problem is that you're structured your DLT program incorrectly. Programs that are written for DLT should be declarative by design, but in your case you're performing your actions on the top-level, not inside the functions marked as #dlt.table. When DLT pipeline is starting, it's building the execution graph by evaluating all code, and identifying vertices of the execution graph that are marked with #dlt annotations (you can see that your function is called several times, as explained here). And because your code is having side effect of reading all data with spark.read.jdbc, interacting with Cosmos, etc., then the initialization step is really slow.
To illustrate the problem let look onto your code structure. Right now you have following:
def read(...):
1. Perform read via `spark.read.jdbc` into `df`
2. Perform operations with Cosmos DB
3. Return annotated function that will just return captured `df`
As result of this, items 1 & 2 are performed during the initialization stage, not when the actual pipeline is executed.
To mitigate this problem you need to change structure to following:
def read(...):
1. Return annotated function that will:
1. Perform read via `spark.read.jdbc` into `df`
2. Perform operations with Cosmos DB
Is there any way we can pull more than 10 records from workwave POD Api for more than 10 records?
Whenever I call workwave API through Map/Reduce script it's giving me an error message to slow down. Has anyone got this experience and how they did they manage to achieve this?
Thanks
If you're using the List Orders or Get Orders API, there is a throttling limit - "Leaky bucket (size:10, refill: 1 per minute)". However, both those APIs allow for retrieving multiple orders in a single call. My suggestion would be to restructure your script so that instead of making the call to Workwave in the reduce stage for a single order, you make it in the Get Input Data stage for all orders you want to operate on, and map the relevant data to the corresponding NetSuite data in the Map stage before passing in through to the Reduce stage.
In other words, you make one call listing multiple order ids rather than multiple calls listing one order id.
I have 100 rows in an Azure table storage. But later I can add more rows or set "disabled" property on any row in the table.
I have an Azure function - "XProcessor". I would like to have an Azure function "HostFunction" which would start a new instance of the "XProcessor" for each row from the Azure table storage.
The "HostFunction" should be able to pass details of a table row to the instance of the "XProcessor" and the "HostFunction" needs to be executed every minute.
How do I achieve this? I am looking into the Azure Logic app but not sure yet how to orchestrate "XProcessor" with the details.
I would look in to using a combination of "Durable Functions" techniques.
Eternal Orchestration - will allow you to enable your process to run, wait a set period of time after completion, and then run again.
From the docs: Eternal orchestrations are orchestrator functions that never end. They are useful when you want to use Durable Functions for aggregators and any scenario that requires an infinite loop.
Fan-in-fan-out - will allow you to call a separate function per row.
From the docs: Fan-out/fan-in refers to the pattern of executing multiple functions in parallel, and then waiting for all to finish. Often some aggregation work is done on results returned from the functions.
There is a bit of additional overhead getting going with durable functions but it gives you fine grained control over your execution. Keep in mind that state of objects is serialised in durable functions at every await call, so thousands of rows would potentially be an issue, but for the scenario you describe it will work well and I have had a lot of success with it.
Good luck!
Durable functions is your go to option here. what you can do is
1) Have a controller function called the orchestration function
2) Another child function which can be invoked multiple times by the orchestration
And in your orchestrate function, you wait till all the child instances give you the response back, hence you can get a fan-out- fan in scenario.
Have a look at following links
https://blog.mexia.com.au/tag/azure-durable-functions and https://learn.microsoft.com/en-us/azure/azure-functions/durable-functions-overview
I am try to utilise the For Each object in logic apps and I am not sure if this is possible:
Step 1 - SQL Stored Procedure is executed returning a number of rows
Step 2 - For Each based on the number of rows in Step 1
Step 2a - Execute a SQL Procedure based on values in loop
The problem I am finding is that when I add the loop, it states that there is no dynamic content available even though Step 1 will return results.
I've googled and there seems to be no information on using the for each loop in Logic Apps.
Any suggestions?
Thanks
We're working on a fix for this. Before the fix is available, try one of these two workarounds:
If you're comfortable with code view, create a for-each loop in designer, then specify ResultSets as the input for foreach.
Alternatively, add a Compose card after the SQL action, choose ResultSets as the input. Then use the Output of the Compose card for for-each loop.
unfortunately its by design:
https://learn.microsoft.com/en-us/azure/automation/automation-runbook-types#powershell-runbooks
I resolved it by calling the SP's in parallel via Logic Apps.
(note: when you do so, don't forget to hit 'Publish' runbook in order for the input variables passed to be picked up correctly)
I want to create a job in spring data which should consist of two steps:
Step 1 - First step reads certain transactions from database and produces a list of record Ids that will be sent to step 2 via jobContext attribute.
Step 2 - This should be a partition step: The slave steps should be partitioned based on the list obtained from step 1 (each thread gets a different Id from the list) and perform their read/process/write operations without interfering with each other.
My problem is that even though I want to partition data based on the list produced by step 1, spring configures step 2 (and thus, calls the partitioner's partition() method) before step 1 even starts, so I cannot inject the partitioning criteria on time. I tried using #StepScope on the partitioner bean, but it still attempts to create the partitions before the job starts.
Is there a way to dynamically create the step partitions during runtime, or an alternative way to divide a step into threads based on the list provided by step 1?
Some background:
I am working on a batch job using spring batch which has to process Transactions stored in a database. Every transaction is tied to an Account (in a different table), which has an accountBalance that also needs to be updated whenever the transaction is processed.
Since I want to perform these operations using multi-threading, I thought a good way to avoid collisions would be to group transactions based on their accountId, and have each thread process only the transactions that belong to that specific accountId. This way, no two threads will attempt to modify the same Account at the same time, as their Transactions will always belong to different Accounts.
However, I cannot know which accountIds need to be processed until I get the list of transactions to process and extract the list from there, so I need to be able to provide the list to partition during runtime. Thtat's why I thought I could generate that list in a previous step, and then have the next step partition and process the data accordingly.
Is the approach I am taking plausible with this setup? Or should I just look for a different solution?
I couldn't find a way to partition the data mid-job like I wanted, so I had to use this workaround:
Instead of dividing the job in two steps, I moved the logic from step 1 (the "setup step") into a service method that returns the list of transactions to process, and added a call to that method inside the partition() method in my partitioner, allowing me to create the partitions based on the returned list.
This achieves the same result in my case, although I'm still interested in knowing if it is possible to configure the partitions mid-job, since this solution would not work if I had to perform more complex processing or writing in the setup step and wanted to configure exception handling policies and such. It probably would not work either if the setup step was placed in the middle of a step chain instead of at the start.