Suggested way for ADF to trigger pipeline by SQL table change - azure

I have a tracking SQL table which has following schema:
CREATE TABLE [dbo].[TEST_TABLE](
[id] [int] IDENTITY(1,1) NOT NULL,
[value] [nvarchar](50) NULL,
[status] [nvarchar](50) NULL,
[source] [nvarchar](50) NULL,
[timestamp] [datetime] NULL
)
My application code will automatically maintain the table by inserting record and updating the field status.
My target is to trigger an ADF pipeline based on the result of following query:
SELECT COUNT(1) AS cnt FROM [dbo].[TEST_TABLE] WHERE [status] = 'active'
If the result is >0, then trigger an ADF pipeline.
Current status:
My current work:
set up an Stored procedure SP_TEST to return 1 if condition is filled; otherwise 0
setup an pipeline like below:
the result of SP is parsed and used for routing to trigger later stages (which will mark the SQL table status to 'inactive' to avoid duplicate processing)
3. associate the pipeline with a scheduling trigger every 5 minutes.
My current work is "working", in the sense that it can detect for whether there is DB change every 5 minutes and execute subsequent processing.
Problem:
However, the scheuling trigger may be too frequent and cost activity run unit on every execution, which could be costly. Is there any trigger like "SQL table change trigger"?
what I have tried:
A quick google points me to this link, but seems no answer yet.
I am also aware of storage event trigger and custom events trigger. Unfortunately, we are not permitted to create other Azure resource. Only the existing ADF and SQL server is provided to us.
Appreciate any insights/directions in advance.

Polling using ADF can be expensive, we want to avoid that. Instead have the polling take place within an Azure Logic App, it's much cheaper. Here are the steps to listen to a SQL Server DB (Azure included) then trigger an ADF pipeline if a table change is found.
Here is the pricing for Azure Logic App:
I believe this means that every trigger is using a standard connector, so it will be 12.5 cents (USD) per 1000 firings of the app, and 2.5 cents (USD) per 1000 actions triggered.
For ADF it is $1 (USD) per 1000 activities, so it's much more expensive for ADF
Please let me know if you have any issues at all!

Related

Azure Data Factory ForEach activity pricing

I'm having a hard time understanding how ADF will be charged in the following scenario
1 pipeline, 2 activities with 1 being a ForEach which will loop 1000+ times. Activity inside the ForEach is a stored procedure.
So will this be 2 activity runs or more than 1000?
You can see the result in ADF monitor - but it will be 1002 activities at $1/1000 (or whatever the current rate is).
It's much cheaper (if you mind the dollar) if you can pass the list into your proc; the lookup.output field is just json with your table list in it, which you could parse in the proc.
https://learn.microsoft.com/en-us/azure/data-factory/pricing-concepts

How to copy managed database?

AFAIK there is no REST API providing this functionality directly. So, I am using restore for this (there are other ways but those don’t guarantee transactional consistency and are more complicated) via Create request.
Since it is not possible to turn off short time backup (retention has to be at least 1 day) it should be reliable. I am using current time for ‘properties.restorePointInTime’ property in request. This works fine for most databases. But one db returns me this error (from async operation request):
"error": {
"code": "BackupSetNotFound",
"message": "No backups were found to restore the database to the point in time 6/14/2021 8:20:00 PM (UTC). Please contact support to restore the database."
}
I know I am not out of range because if the restore time is before ‘earliestRestorePoint’ (this can be found in GET request on managed database) or in future I get ‘PitrPointInTimeInvalid’ error. Nevertheless, I found some information that I shouldn’t use current time but rather current time - 6 minutes at most. This is also true if done via Azure Portal (where it fails with the same error btw) which doesn’t allow to input time newer than current - 6 minutes. After few tries, I found out that current time - circa 40 minutes starts to work fine. But 40 minutes is a lot and I didn’t find any way to find out what time works before I try and wait for result of async operation.
My question is: Is there a way to find what is the latest time possible for restore?
Or is there a better way to do ‘copy’ of managed database which guarantees transactional consistency and is reasonably quick?
EDIT:
The issue I was describing was reported to MS. It was occuring when:
there is a custom time zone format e.g. UTC + 1 hour.
Backups are skipped for the source database at the desired point in time because the database is inactive (no active transactions).
This should be fixed as of now (25th of August 2021) and I were not able to reproduce it with current time - 10 minutes. Also I was told there should be new API which would allow to make copy without using PITR (no sooner than 1Q/22).
To answer your first question "Is there a way to find what is the latest time possible for restore?"
Yes. Via SQL. The only way to find this out is by using extended event (XEvent) sessions to monitor backup activity.
Process to start logging the backup_restore_progress_trace extended event and report on it is described here https://learn.microsoft.com/en-us/azure/azure-sql/managed-instance/backup-activity-monitor
Including the SQL here in case the link goes stale.
This is for storing in the ring buffer (max last 1000 records):
CREATE EVENT SESSION [Verbose backup trace] ON SERVER
ADD EVENT sqlserver.backup_restore_progress_trace(
WHERE (
[operation_type]=(0) AND (
[trace_message] like '%100 percent%' OR
[trace_message] like '%BACKUP DATABASE%' OR [trace_message] like '%BACKUP LOG%'))
)
ADD TARGET package0.ring_buffer
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,
MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,
TRACK_CAUSALITY=OFF,STARTUP_STATE=ON)
ALTER EVENT SESSION [Verbose backup trace] ON SERVER
STATE = start;
Then to see output of all backup events:
WITH
a AS (SELECT xed = CAST(xet.target_data AS xml)
FROM sys.dm_xe_session_targets AS xet
JOIN sys.dm_xe_sessions AS xe
ON (xe.address = xet.event_session_address)
WHERE xe.name = 'Verbose backup trace'),
b AS(SELECT
d.n.value('(#timestamp)[1]', 'datetime2') AS [timestamp],
ISNULL(db.name, d.n.value('(data[#name="database_name"]/value)[1]', 'varchar(200)')) AS database_name,
d.n.value('(data[#name="trace_message"]/value)[1]', 'varchar(4000)') AS trace_message
FROM a
CROSS APPLY xed.nodes('/RingBufferTarget/event') d(n)
LEFT JOIN master.sys.databases db
ON db.physical_database_name = d.n.value('(data[#name="database_name"]/value)[1]', 'varchar(200)'))
SELECT * FROM b
NOTE: This tip came to me via Microsoft support when I had the same issue of point in time restores failing what seemed like randomly. They do not give any SLA for log backups. I found that on a busy database the log backups seemed to happen every 5-10 minutes but on a quiet database hourly. Recovery of a database this way can be slow depending on number of transaction logs and amount of activity to replay etc. (https://learn.microsoft.com/en-us/azure/azure-sql/database/recovery-using-backups)
To answer your second question: "Or is there a better way to do ‘copy’ of managed database which guarantees transactional consistency and is reasonably quick?"
I'd have to agree with Thomas - if you're after guaranteed transactional consistency and speed you need to look at creating a failover group https://learn.microsoft.com/en-us/azure/azure-sql/database/auto-failover-group-overview?tabs=azure-powershell#best-practices-for-sql-managed-instance and https://learn.microsoft.com/en-us/azure/azure-sql/managed-instance/failover-group-add-instance-tutorial?tabs=azure-portal
A failover group for a managed instance will have a primary server and failover server with the same user databases on each kept in synch.
But yes, whether this suits your needs depends on the question Thomas asked of what is the purpose of the copy.

OData query for all rows from the last 10 minutes

I need to filter rows from an Azure Table Store that are less than 10 minutes old. I'm using a Azure Function App integration to query the table, so a coded solution is not viable in this case.
I'm aware of the datetime type, but for this I have to specify an explicit datetime, for example -
Timestamp gt datetime'2018-07-10T12:00:00.1234567Z'
However, this is insufficient as I need the query to run on a timer every 10 minutes.
According to the OData docs, there are built in functions such as totaloffsetminutes() and now(), but using these causes the function to fail.
[Error] Exception while executing function: Functions.FailedEventsCount. Microsoft.WindowsAzure.Storage: The remote server returned an error: (400) Bad Request.
Is there a way to query a Table Store dynamically in this way?
Turns out that this was easier than expected.
I added the following query filter to the Azure Table Store input integration -
Timestamp gt datetime'{afterDateTime}'
In conjunction with a parameter in the Function trigger route, and Bob's your uncle -
FailedEventsCount/after/{afterDateTime}
Appreciate for other use cases it may not be viable to pass in the datatime, but for me that is perfectly acceptable.

Execute stored procedure within Azure Logic App fails with Gateway Timeout

I've been trying to develop an Azure Logic App that imports files from an FTP server and with a stored procedure in an Azure SQL service parse the contents.
Currently I've been struggling with executing this stored procedure from the logic app; the stored procedure can take up to 10 minutes to execute.
I've been trying a few solution setting up the Execute Stored Procedure Action in the Azure Logic App:
- Add execute stored procedure as an action with an asynchronous timeout of (PT1H)
- Surround it with a do-until loop that checks the return code.
None of these solutions seem to be resolving the issue. Does anyone have anything else I can try when developing this Azure Logic App?
If you could reduce the time of SP by reducing the data payload in the tables under JOIN, then you could use pagination to achieve the successful execution via Logic App.
For example let's say you have a stored procedure like sp_UpdateAColumn which updates columns on tableA based on JOINs with tableB and tableC and tableD
Now this does run but takes more than 2 minutes to finish, because of the large number of rows in tableA.
You can reduce the time on this SP by say creating a new column isUpdated on tableA which is say boolean and by default has value =0
So then if you use
SELECT TOP 100 * FROM tableA WHERE isUpdated =0
instead of whole tableA in the JOIN then you should be able to update the 100 rows in under two minutes.
So if you change your definition of SP from sp_UpdateAColumn to
sp_UpdateAColumnSomeRows(pageSize int) then in this SP all you need to do is in the JOINs where you use TableA use
(SELECT TOP (SELECT pageSize ) * FROM tableA WHERE isUpdated =0) instead.
Now you need to ensure that this new SP is called enough times to process all records, for this you need to use a do-until loop in logic app ( for total rows in TableA/pazeSize times) and call your SP inside this loop.
Try tweaking with PageSize parameter to find optimal paging size.

Workflow System with Azure Table Storage

I have a system where we need to run a simple workflow.
Example:
On Jan 1st 08:15 trigger task A for object Z
When triggered then run some code (implementation details not important)
Schedule task B for object Z to run at Jan 3rd 10:25 (and so on)
The workflow itself is simple, but I need to run 500.000+ instances and that's the tricky part.
I know Windows Workflow Foundation and for that very same reason I have chosen not to use that.
My initial design would be to use Azure Table Storage and I would really appreciate some feedback on the design.
The system will consist of two tables
Table "Jobs"
PartitionKey: ObjectId
Rowkey: ProcessOn (UTC Ticks in reverse so that newest are on top)
Attributes: State (Pending, Processed, Error, Skipped), etc...
Table "Timetable"
PartitionKey: YYYYMMDD
Rowkey: YYYYMMDDHHMM_<GUID>
Attributes: Job_PartitionKey, Job_RowKey
The idea is that the runs table will have the complete history of jobs per object and the Timetable will have a list of all jobs to run in the future.
Some assumptions:
A job will never span more than one Object
There will only ever be one pending job per Object
The "job" is very lightweight e.g. posting a message to a queue
The system must be able to perform these tasks:
Execute pending jobs
Query for all records in "Timetable" with a "partition <= Today" and "RowKey <= today"
For each record (in parallel)
Lookup job in Jobs table via PartitionKey and RowKey
If "not exists" or State != Pending then skip
Execute "logic". If fails => log and maybe do some retry logic
Submit "Next run date in Timetable"
Submit "Update State = Processed" and "New Job Record (next run)" as a single transaction
When all are finished => Delete all processed Timetable records
Concern: Only two of the three records modifications are in a transaction. Could this be overcome in any way?
Stop workflow
Stop/pause workflow for Object Z
Query top 1 jobs in Jobs table by PartitionKey
If any AND State == Pending then update to "Cancelled"
(No need to bother cleaning Timetable it will clean itself up "when time comes")
Start workflow
Create Pending record in Jobs table
Create record in Timetable
In terms of "executing the thing" I would
be using a Azure Function or Scheduler-thing to execute the pending jobs every 5 minutes or so.
Any comments or suggestions would be highly appreciated.
Thanks!
How about using Service Bus instead? The BrokeredMessage class has a property called ScheduledEnqueueTimeUtc. You can just schedule when you want your jobs to run via the ScheduledEnqueueTimeUtc property, and then fuggedabouddit. You can then have a triggered webjob that monitors the Service Bus messaging queue, and will be triggered very near when the job message is enqueued. I'm a big fan of relying on existing services to minimize the coding needed.

Resources