What are the Azure Data Factory Thumbling Window trigger limitations? - azure

I have a Multiple pipelines dependent on each other, I want to use Thumbling window trigger, Requesting you to suggested, Is it right approach? Are there any limitations using Thumbling window trigger?
Regards,
Ashok

Tumbling window trigger is a more heavy weight alternative for schedule trigger offering a suite of features for complex scenarios(dependency on other tumbling window triggers, rerunning a failed job and set user retry for pipelines).--MSFT documentation
Limitation:
A tumbling window trigger has a one-to-one relationship with a pipeline and can only reference a singular pipeline.
If one of the dependencies triggers fails, you must successfully rerun it in order for the dependent trigger to run.
A tumbling window trigger will wait on dependencies for seven days before timing out. After seven days, the trigger run will fail

Related

Azure Data Factory -> Tumbling Window

I have a schedule trigger that executes a pipeline every day once at 6:15 AM. And now my requirement is I have multiple pipelines which I want to run post-completion of my pipeline part of the said schedule trigger. How can I achieve this using the Tumbling window? I don't find a way.
Thanks,
Vivek
You can add a dependency on the scheduling trigger in the Advanced section of the tumbling window trigger settings.

How is ForEach activity declaring success in Data Factory?

In the following scenario we have a ForEach activity running in an Azure Data Factory pipeline to copy data from source to destination.
The last CopyActivity took 4:10:33 but the ForEach activity declared Succeeded 36 Minutes later: 4:46:12.
The question is, why ForEach activity need this 36 Minutes extra?
Is it the case that the ForEach needs also to consolidate results from subactivities before declaring success or fail?
Official answer from Microsoft: ForEach activity does wait for all inner activity runs to complete. In theory, there should not be much delay on marking foreach run success after the last activity run within it succeed. However, ADF rely on partner service to execute the runs and it's possible that the partner service run into failures and could not complete foreach in time. They have build in logic to keep retry and recover but the behavior in ADF activity runs is delay. It's also possible that orchestration service fails and partner service keep retry on calling us. but usually partner service delay is the main cause here.
Our assumption: Duration time is end-to-end for the pipeline activity. That takes into account all factors like marshaling of your data flow script from ADF to the Spark cluster, cluster acquisition time, job execution, and I/O write time. Due to ADF is serverless compute, I think ForEach needs time to wait for all activities to acquire and release computing resources, but this is my guess because there are few official explanations.
So there will be a delay time, which varies according to internal activities.
Official answer from Microsoft: ForEach activity does wait for all inner activity runs to complete. In theory, there should not be much delay on marking foreach run success after the last activity run within it succeed. However, ADF rely on partner service to execute the runs and it's possible that the partner service run into failures and could not complete foreach in time. They have build in logic to keep retry and recover but the behavior in ADF activity runs is delay. It's also possible that orchestration service fails and partner service keep retry on calling us. but usually partner service delay is the main cause here.

Tumbling Window trigger in Azure Data Factory - Self Run

I have a got an ADF V2 Pipeline that runs hourly between 7am and 5pm only. So far I have been using an "Event" trigger that runs hourly and it was fine.
But somehow the load started to run for more than one hour.
As a result, the next load would start while the previous one was still running.
I have been trying to use a "Tumbling Window" trigger to create a self dependency on this pipeline so that it waits for the previous one to be completed before running but could not make it work.
If someone has some experience about how to tackle this problem, any insight would be much appreciated.
Max Concurrency property can be used in tumbling window triggers where if it is set to 1 it will not let the other instance of trigger to be initiated until the other one completes.
Document can be referred for the details

Make the azure batch job schedule not wait on the previous iteration

I have an Azure Batch service set up with a job schedule that runs every minute.
The job manager task creates 3-10 tasks within the same job.
Sometimes, one of these tasks within the job may take extremely long to complete but usually are very fast.
In the event that one of the tasks takes long to apply, the next iteration of the job manager task does not begin in that case. It basically waits till all the tasks from the previous iteration have completed.
Is there a way to ensure that the job schedule keeps creating a version of the job every minute even if all the tasks from its previous iteration have not been completed?
I know one option is to make the job manager task create additional jobs instead of tasks. But preferably, I was hoping there is some configuration at the job schedule level that I can turn on that will allow the schedule to create tasks without the dependency of completion on the previous job.
This seems like more towards design question, AFAIK, No, the duplicate active job names should not be doable from az batch perspective. (I will get corrected if at all this is doable somehow)
Although in order to further think this you can read through various design recommendations via Azure batch technical overview page or posts like:
How to use Azure Batch in an event based design and terminate/cleanup finished jobs or
Add Tasks to a running Azure batch job and manually control termination
I think simplicity will be better like handling each iteration with unique job name or some thing of other sort but you will know your scenario better. Hope this helps.
Currently, a Job Schedule can have at most one active Job under it at any given time (link) so the behavour you're seeing is expected.
We don't have any simple feature you can just "turn on" to achieve concurrent jobs from a single job schedule - but I do have a suggestion:
Instead of using the JobSchedule to run all the processing directly, use it to create "worker" jobs that do the processing.
E.g.
At 10:03 am, your job schedule triggers to create job processing-20191031-1003.
At 10:04 am, your job schedule triggers to create job processing-20191031-1004.
At 10:05 am, your job schedule triggers to create job processing-20191031-1005.
and so on
Because the only thing your job schedule does is create another job, it will finish very quickly, ensuring the next job is created on time.
Since your existing jobs already create a variable number of tasks (you said 3-10 tasks, above), I'm hoping this won't be a very complex change for your code.
Note that you will need to ensure your concurrent worker jobs don't step on each others toes by trying to do the same work multiple times.

Azure Data factory pipeline varied execution times

We have about 190 hourly usage files that need to arrive in the data lake in a 24 hour period before we can kick off our pipeline which starts off with an analytics activity. We have had this pipeline run on a scheduler on an estimated time of when we expect all files to have arrived but doesn't always happen so we would need to re-run the slices for the missing files.
Is there a more efficient way to handle this and not have the pipeline on a schedule and have it triggered by the event that all files have arrived in the datalake.
TIA for input!
You can add an Event Trigger when a new blob is created (or deleted). We do this in production with a logic app, but data factory V2 appear to support it now as well. The benefit is that you don't have to estimate the proper frequency, you can just execute when necessary.
NOTE: there is a limit to the number of concurrent pipelines you can have executing, so if you dropped all 190 files into blob storage at once, you may run into resource availability issues.

Resources