I'm having a hard time understanding how ADF will be charged in the following scenario
1 pipeline, 2 activities with 1 being a ForEach which will loop 1000+ times. Activity inside the ForEach is a stored procedure.
So will this be 2 activity runs or more than 1000?
You can see the result in ADF monitor - but it will be 1002 activities at $1/1000 (or whatever the current rate is).
It's much cheaper (if you mind the dollar) if you can pass the list into your proc; the lookup.output field is just json with your table list in it, which you could parse in the proc.
https://learn.microsoft.com/en-us/azure/data-factory/pricing-concepts
Related
I have a requirement to move a file from ADLS Gen 2 from path(directory) 'A' to directory 'B' or 'C' based on 2 conditions : Move to 'C' if file is not csv or file size is 0 else move to 'B'.
I am planning to use Event grid (as soon as file lands in location 'A')+ Azure function (for checks and move to location 'B' or 'C').
If there are 100 files landing per day, this approach will trigger azure function 100 times.
Is there a better way to do this - can this smarts be built using just one service (such as Event Hub instead of Event grid + Function) so that there is less overhead to maintain.
Thanks for your time.
If you want low effort then try Logic Apps.
What you want is to create a Logic App with Blob Trigger, that will be triggered when there are new blobs. That takes care of trigger.
For action, you can use the "copy blob" if you like. Not sure if there is a "move blob" action supported, but if it's not and "copy blob" action isn't good enough for you then you can provide a custom JS snippet action as inline code.
Couple of notes:
If your Azure Functions are called only 100 times a day and they are only doing some small check and then moving the blob, then under consumption plan you'll probably pay less than a $1 US per month.
With Azure Functions you'll have a lot more control and it'll take you a lot longer (compared to Logic Apps) to develop/operate.
can this smarts be built using just one service
Of course, you can directly use blob trigger of azure function.
https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-blob-trigger?tabs=csharp
If there are 100 files landing per day, this approach will trigger
azure function 100 times.
You can use azure function to do a daily check instead of use event grid to trigger function(Timertrigger).
https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-timer?tabs=csharp
Just put the logic in the body of function.
I'm thinking of using Data Factory in order to copy data from a Blob Storage container to an SQL table but I'm not quite sure I understand how the pricing works, specifically how the activities are counted.
So if I have a pipeline with 3 activities that copies the data from a CSV with 1000 lines will the total activity count be 3*1 or 3*1000? In other words, will I be charged based on the number o files it processes or the total number of lines it copies?
That's 3 activity runs. Activity runs are measured by the thousand, at $1 per. Since these are Copy activities, they consume Data Integration Units (DIU) at $.25 per hour. Pipeline execution time is billed at $.005 per hour. If you add all this up for 1 pipeline with 3 Copy activities that runs for 1 hour, your total bill is like 27 cents.
We run THOUSANDS of pipelines a month, all with many activities including quite a few Copy activities. Our Data Factory billing is still so low that it looks like a rounding error in our total Azure spend.
The exception to this is Data Flow. Data Flow is a Spark wrapper, so you have to pay for Cluster time, which can get expensive quickly if you aren't careful.
Actually, you have to pay for 2 important metrics: Orchestration and Execution. Please refer to more details from this document.
1.Orchestration, $1 per 1,000 runs. You have 3 activities, then it should be $ 3/1000.
2.Execution, it depends on the DIU you configured,which means the performance of your transmission.
I've got an Azure Data Factory V2 pipeline with multiple Copy Data activities that run in parallel.
I have a Pause DW web hook pauses an Azure data warehouse after each run. This activity is set to run after the completion of one of the longest running activities in the pipeline. The pipeline is set to trigger nightly.
Unfortunately, the time taken to run copy data activities varies because it depends on transactions that have been processed in the business, which varies each day. This means, I can't predict which activity of those that run in parallel will finish last. This means, often the whole pipeline fails because the DW has been paused before some of the activities have started.
What's the best way of running an activity only after all other activities in the pipeline have completed?
I have tried to add an If activity to the pipeline like this:
However, I then run into this error during validation:
If Condition1
The output of activity 'Copy small tables' can't be referenced since it has no output.
Does anyone have any idea how I can move this forwards?
thanks
Just orchestrate all your parallel activities towards PAUSE DWH activity. Then it will be executed after all your activities are completed.
I think you can use the Execute pipeline activity .
Let the trigger point to the new pipeline which has the "Execute activity " which points to the current ADF with the copy activity , please do select the option Advanced -> Wait for completion . Once the execute pipeline is done it should to move to the webhook activity which should have the logic to pause the DW .
Let me know how this goes .
I have a system where we need to run a simple workflow.
Example:
On Jan 1st 08:15 trigger task A for object Z
When triggered then run some code (implementation details not important)
Schedule task B for object Z to run at Jan 3rd 10:25 (and so on)
The workflow itself is simple, but I need to run 500.000+ instances and that's the tricky part.
I know Windows Workflow Foundation and for that very same reason I have chosen not to use that.
My initial design would be to use Azure Table Storage and I would really appreciate some feedback on the design.
The system will consist of two tables
Table "Jobs"
PartitionKey: ObjectId
Rowkey: ProcessOn (UTC Ticks in reverse so that newest are on top)
Attributes: State (Pending, Processed, Error, Skipped), etc...
Table "Timetable"
PartitionKey: YYYYMMDD
Rowkey: YYYYMMDDHHMM_<GUID>
Attributes: Job_PartitionKey, Job_RowKey
The idea is that the runs table will have the complete history of jobs per object and the Timetable will have a list of all jobs to run in the future.
Some assumptions:
A job will never span more than one Object
There will only ever be one pending job per Object
The "job" is very lightweight e.g. posting a message to a queue
The system must be able to perform these tasks:
Execute pending jobs
Query for all records in "Timetable" with a "partition <= Today" and "RowKey <= today"
For each record (in parallel)
Lookup job in Jobs table via PartitionKey and RowKey
If "not exists" or State != Pending then skip
Execute "logic". If fails => log and maybe do some retry logic
Submit "Next run date in Timetable"
Submit "Update State = Processed" and "New Job Record (next run)" as a single transaction
When all are finished => Delete all processed Timetable records
Concern: Only two of the three records modifications are in a transaction. Could this be overcome in any way?
Stop workflow
Stop/pause workflow for Object Z
Query top 1 jobs in Jobs table by PartitionKey
If any AND State == Pending then update to "Cancelled"
(No need to bother cleaning Timetable it will clean itself up "when time comes")
Start workflow
Create Pending record in Jobs table
Create record in Timetable
In terms of "executing the thing" I would
be using a Azure Function or Scheduler-thing to execute the pending jobs every 5 minutes or so.
Any comments or suggestions would be highly appreciated.
Thanks!
How about using Service Bus instead? The BrokeredMessage class has a property called ScheduledEnqueueTimeUtc. You can just schedule when you want your jobs to run via the ScheduledEnqueueTimeUtc property, and then fuggedabouddit. You can then have a triggered webjob that monitors the Service Bus messaging queue, and will be triggered very near when the job message is enqueued. I'm a big fan of relying on existing services to minimize the coding needed.
I have some trouble understanding slicing (Dataset Availability) in Azure Data Factory. Let's say I have a source dataset which never changes. Then I for some reason set up hourly slicing for my source data set. Will each slice then be identical? What is the point of using slices at all in such case (i.e. why is it Required)?
Or another case, let's say my source dataset is appended with new data continuously (for example an event log). And each morning I want to do some analysis on all history of that log. Should I then set up daily slicing? Will each slice include the full history or just the last day?
The slices are the intervals in which the pipeline is executed within the period defined in the start and end properties of the pipeline.
If you have a fix source and you execute an activity more than once, it will always use the same source (because it does not change). Lets say you set the start time and end time to be a day, and set the frequency to be 1 hour - the activity will be executed 24 times. You will have 24 slices, all using the same data source.
For your second scenario, if the data keeps changing, you can set the frequency to once a day. What will be processed depends on the activity you define in the pipeline - lets say that the pipeline deletes the old source once it finish processing, or there's logic in the activity the takes only the new data.