I have a lambda written in node js which executes every 15 minutes. I need to compare the records processed in the first execution (list of strings to indicate all the records) in the next execution and avoid processing the same records based on the string comparison. So basically first execution will store the info in a list of strings and then in the second execution, I would first compare the records about to get processed with each string present in the collection from first execution. Once done processing the fresh records in the second execution of lambda, I will then replace the string collection.list with new records for the same comparison in the third execution.
I figured that we should not be using any global variable as they tend to get changed in the execution.
So is there a way to achieve it?
No you can't save variables like that using Lambda. You can however save the list in a text file on s3. Read this file during next execution and make the necessary edits for the execution after that.
Related
I am trying to retrieve a specific step function execution input in the past using the list_executions and describe_execution functions in boto3, first to retrieve all the calls and then to get the execution input (I can't use describe_execution directly as I do not know the full state machine ARN). However, list_executions does not accept a filter argument (such as "name"), so there is no way to return partial matches, but rather it returns all (successful) executions.
The solution for now has been to list all the executions and then loop over the list and select the right one. The issue is that this function can return a max 1000 newest records (as per the documentation), which will soon be an issue as there will be more than 1000 executions and I will need to get old executions.
Is there a way to specify a filter in the list_executions/describe_execution function to retrieve execution partially filtered, for ex. using prefix?
import boto3
sf=boto3.client("stepfunctions").list_executions(
stateMachineArn="arn:aws:states:something-something",
statusFilter="SUCCEEDED",
maxResults=1000
)
You are right that the SFN APIs like ListExecutions do not expose other filtering options. Nonetheless, here are two ideas to make your task of searching execution inputs easier:
Use the ListExecutions Paginator to help with looping through the response items.
If you know in advance which inputs are of interest, add a Task to the State Machine to persist execution inputs and ARNs to, say, a DynamoDB table, in a manner that makes subsequent searches easier.
If I schedule a timer triggered Azure function to run every second and my function is taking 2 seconds to execute, will I just get back-to-back executions or will some execution queue eventually overflow?
Background:
We have an timer triggered Azure function that is currently executing every 30 seconds and is checking for new rows in a database table. If there are new rows, the data will be processed and the rows will be marked as handled.
If there are no new rows the execution is very fast. If there are 500 new rows (which is the max we are fetching at the moment) the execution takes about 20-25 seconds.
We would like to decrease the interval to one second to reduce the latency or row processing.
Update: I want back-to-back executions and I want to avoid overlapping executions.
Multiple azure functions can run concurrently. This is means you can still trigger the function again while the previous triggered function is still running. They will both run concurrently. They will only queue up if you setup options to run only 1 function at a time on 1 instance but doesn't look like you want that.
With concurrency, this means that 2 functions will read the same table on the DB at the same time. So you should read your table with UPDLOCK option LINK. This will prevent the subsequent triggered function from reading the same rows that were read in the previous function.
In short, the answer to your question is neither. If your functions overlap, by default, you will get multiple functions running at the same time. LINK
To achieve back to back execution for time triggers, set WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT and FUNCTIONS_WORKER_PROCESS_COUNT as 1 in the application settings configuration. This will ensure only 1 function executes runs at a time . See this LINK.
I have a node.js function that needs to be executed for each order on my application. In this function my app gets an order number from a oracle database, process the order and then adds + 1 to that number on the database (needs to be the last thing on the function because order can fail and therefore the number will not be used).
If all recieved orders at time T are processed at the same time (asynchronously) then the same order number will be used for multiple orders and I don't want that.
So I used rabbit to try to remedy this situation since it was a queue. It seems that the processes finishes in the order they should, but a second process does NOT wait for the first one to finish (ack) to begin, so in the end I'm having the same problem of using the same order number multiple times.
Is there anyway I can configure my queue to process one message at a time? To only start process n+1 when process n has been acknowledged?
This would be a life saver to me!
If the problem is to avoid duplicate order numbers, then use an Oracle sequence, or use an identity column when you insert into a table to generate the order number:
CREATE TABLE mytab (
id NUMBER GENERATED BY DEFAULT ON NULL AS IDENTITY(START WITH 1),
data VARCHAR2(20));
INSERT INTO mytab (data) VALUES ('abc');
INSERT INTO mytab (data) VALUES ('def');
SELECT * FROM mytab;
This will give:
ID DATA
---------- --------------------
1 abc
2 def
If the problem is that you want orders to be processed sequentially, then don't pull an order from the queue until the previous one is finished. This will limit your throughput, so you need to understand your requirements and make some architectural decisions.
Overall, it sounds Oracle Advanced Queuing would be a good fit. See the node-oracledb documentation on AQ.
I have to do 100000 sequential HTTP requests with Spark. I have to store responses into S3. I said sequential, because each request returns around 50KB of data, and I have to keep 1 second in order to not exceed API rate limits.
Where to make HTTP calls: from Spark Job's code (executed on driver/master node) or from dataset transformation (executed on worker node)?
Workarrounds
Make HTTP request from my Spark job (on Driver/Master node), create dataset of each HTTP response (each contains 5000 json items) and save each dataset to S3 with help of spark. You do not need to keep dataset after you saved it
Create dataset from all 100000 URLs (move all further computations to workers), make HTTP requests inside map or mapPartition, save single dataset to S3.
The first option
It's simpler and it represents a nature of my compurations - they're sequential, because of 1 second delay. But:
Is it bad to make 100_000 HTTP calls from Driver/Master node?
*Is it more efficient to create/save one 100_000 * 5_000 dataset than creating/saving 100_000 small datasets of size 5_000*
Each time I creating dataset from HTTP response - I'll move response to worker and then save it to S3, right? Double shuffling than...
Second option
Actually it won't benefit from parallel processing, since you have to keep interval of 1 second because request. The only bonus is to moving computations (even if they aren't too hard) from driver. But:
Is it worth of moving computations to workers?
Is it a good idea to make API call inside transformation?
Saving a file <32MB (or whatever fs.s3a.block.size is) to S3 is ~2xGET, 1xLIST and a PUT; you get billed a bit by AWS for each of these calls, plus storage costs.
For larger files, a POST to initiate multipart upload after that first block, one POST per 32 MB (of 32MB, obviously) and a final POST of a JSON file to complete. So: slightly more efficient
Where small S3 sizes matter is in the bills from AWS and followup spark queries: anything you use in spark, pyspark, SQL etc. many small files are slower: Theres a high cost in listing files in S3, and every task pushed out to a spark worker has some setup/commit/complete costs.
regarding doing HTTP API calls inside a worker, well, you can do fun things there. If the result isn't replicable then task failures & retries can give bad answers, but for a GET it should be OK. What is hard is throttling the work; I'll leave you to come up with a strategy there.
Here's an example of uploading files to S3 or other object store in workers; first the RDD of the copy src/dest operations is built up, then they are pushed out to workers. The result of the worker code includes upload duration length info, if someone ever wanted to try and aggregate the stats (though there you'd probably need timestamp for some time series view)
Given you have to serialize the work to one request/second, 100K requests is going to take over a day. if each request takes <1 second, you may as well run it on a single machine. What's important is to save the work incrementally so that if your job fails partway through you can restart from the last checkpoint. I'd personally focus on that problem: how could do this operation such that every 15-20 minutes of work was saved, and on a restart you can carry on from there.
Spark does not handle recovery of a failed job, only task failures. Lose the driver and you get to restart your last query. Break things up.
Something which comes to mind could be
* first RDD takes list of queries and some summary info about any existing checkpointed data, calculates the next 15 minutes of work,
* building up a list of GET calls to delegate to 1+ worker. Either 1 URL/row, or have multiple URLs in a single row
* run that job, save the results
* test recovery works with a smaller window and killing things.
* once happy: do the full run
Maybe also: recognise & react to any throttle events coming off the far end by
1. Sleeping in the worker
1. returning a count of throttle events in the results, so that the driver can initially collect aggregate stats and maybe later tune sleep window for subsequent tasks.
I have encountered a use case where i am fetching rows from tMyslInput and then iterating it one by one to complete a subjob for each of the rows retrieved.
tMysqlInput-----> iterate -------> (job having multiple components such as writing a file, logging it, entry into the database and different processes like this, which is a complete process in itself).
Problem is that since the subjob after iterate link itself takes care of everything, i just want to fork as many subjobs as number of rows fetched from tMysqlInput with different context parameters.
So i tried to do following
tMysqlInput ------>iterate(*n, where n is number of rows fetched)----->(job)
But here what is happening , threads are reading each other context variables hence ending up writing similar context in similar files, same db entry etc..
I want to parrallelize the child job depending on number of rows fetched with threads being in synchronize.
tMysqlInput query lets say, select file_id, input_path , output_path from some table where status='copied';
lets say 4 tuples i got then i want to iterate 4 tuples at the same time. Just execute the child job and let the child job execute on its own.
thanks
try this -
1) Click on Iterate Link - and in component properties Tab you can see Basic Settings - Enable Parallel Execution checkbox (once you check this checkbox) you can enter values of number of iterations you want to run in parallel. This could be number of rows returned by tMysqlInput component (however total number of rows variable will have value AFTER execution of tMysqlInput - globalMap.get("tMysqlInput_X_NB_LINE"))
2) you can pass context variables values in sub jobs - for this first you have to define context variables in your sub job, and then once you have it after iterate link tSubJob click on component properties tab and you will see context Param (table/grid) where you click on + symbol to select context variable and assign its value.