We have about 190 hourly usage files that need to arrive in the data lake in a 24 hour period before we can kick off our pipeline which starts off with an analytics activity. We have had this pipeline run on a scheduler on an estimated time of when we expect all files to have arrived but doesn't always happen so we would need to re-run the slices for the missing files.
Is there a more efficient way to handle this and not have the pipeline on a schedule and have it triggered by the event that all files have arrived in the datalake.
TIA for input!
You can add an Event Trigger when a new blob is created (or deleted). We do this in production with a logic app, but data factory V2 appear to support it now as well. The benefit is that you don't have to estimate the proper frequency, you can just execute when necessary.
NOTE: there is a limit to the number of concurrent pipelines you can have executing, so if you dropped all 190 files into blob storage at once, you may run into resource availability issues.
Related
I am trying to copy a backup file that is provided from a HTTP source, the download URL is only valid for 60 seconds and the Copy step is timing out before it can complete (timeout set to 1 min 1 second). It will complete on occasion but is very inconsistent. When it completes the step queues for around 40 seconds, other times it will be queued for over a minute and the link has expired when it eventually gets to downloading the file. It is one zipped JSON file that is being downloaded, less than 100 KB.
Both Source and Sink datasets are using a Managed VNet IR we have created (must be used on Sink due to company policy), using the AutoResolve IR on the Source and it takes even longer queueing.
I've tried all variations of 'Max concurrent connections', 'DIU' and 'Degree of copy parallelism' in the Copy activity I can think of and none seem to have any effect. It appears to be random if the queue time is short enough for the download to succeed.
Is there any way to speed up the queue process to try and get more consistent successful downloads?
Both Source and Sink datasets are using a Managed VNet IR we have
created (must be used on Sink due to company policy), using the
AutoResolve IR on the Source and it takes even longer queueing.
This is pretty much confusing. If your Source and Sink are in Managed VNet, your IR also should be in same Managed VNet for better security and performance.
As per this official document:
By design, Managed VNet IR takes longer queue time than Azure IR as we
are not reserving one compute node per service instance, so there is a
warm up for each copy activity to start, and it occurs primarily on
VNet join rather than Azure IR.
Since there is no reserved node, there isn't any possible way to speed up the queue process.
In the following scenario we have a ForEach activity running in an Azure Data Factory pipeline to copy data from source to destination.
The last CopyActivity took 4:10:33 but the ForEach activity declared Succeeded 36 Minutes later: 4:46:12.
The question is, why ForEach activity need this 36 Minutes extra?
Is it the case that the ForEach needs also to consolidate results from subactivities before declaring success or fail?
Official answer from Microsoft: ForEach activity does wait for all inner activity runs to complete. In theory, there should not be much delay on marking foreach run success after the last activity run within it succeed. However, ADF rely on partner service to execute the runs and it's possible that the partner service run into failures and could not complete foreach in time. They have build in logic to keep retry and recover but the behavior in ADF activity runs is delay. It's also possible that orchestration service fails and partner service keep retry on calling us. but usually partner service delay is the main cause here.
Our assumption: Duration time is end-to-end for the pipeline activity. That takes into account all factors like marshaling of your data flow script from ADF to the Spark cluster, cluster acquisition time, job execution, and I/O write time. Due to ADF is serverless compute, I think ForEach needs time to wait for all activities to acquire and release computing resources, but this is my guess because there are few official explanations.
So there will be a delay time, which varies according to internal activities.
Official answer from Microsoft: ForEach activity does wait for all inner activity runs to complete. In theory, there should not be much delay on marking foreach run success after the last activity run within it succeed. However, ADF rely on partner service to execute the runs and it's possible that the partner service run into failures and could not complete foreach in time. They have build in logic to keep retry and recover but the behavior in ADF activity runs is delay. It's also possible that orchestration service fails and partner service keep retry on calling us. but usually partner service delay is the main cause here.
I have a pipeline with a few copy activities. Some of those activities are in charge of copying large amounts of data from a storage account to the same storage account but in a compressed manner (I'm talking about a few TB of data).
After running the pipeline for a few hours I noticed that some activities show "Queue" time on the monitoring blade and I was wondering what can be the reason for that "Queue" time. And more importantly if I'm being billed for that time also because from what I understand my ADF is not doing anything.
Can someone shed some light? :)
(Posting this as an answer because of the comment chars limit)
After a long discussion with Azure Support and reaching out to someone at the ADF product team I got some answers:
1 - The queue time is not being billed.
2 - Initially, the orchestration ADF system puts the job in a queue and it gets "queue time" until the infrastructure picks it up and start the processing part.
3 - In my case the queue time was increasing after the job started because of a bug in the underlying backend executor (it uses Azure Batch). Apparently the executors were crashing and my job was suffering from "re-pickup" time, thus increasing the queue time. This explained why after some time I started to see that the execution time and the transferred data were decreasing. The ETA for this bugfix is at the end of the month. Additionally the job that I was executing timed out (after 7 days) and after checking the billing I confirmed that I wasn't charged a dime for it.
Based on the the chart in this ADF Monitor, you could find the same metrics in the example.
In fact,it's metrics in the executionDetails parameter.Queue Time+ Transfer Time= Duration Time.
More details on the stages copy activity goes through, and the
corresponding steps, duration, used configurations, etc. It's not
recommended to parse this section as it may change.
Please refer to the Parallel Copy, copy activity will create parallel tasks to transfer data internally. Activities are all in active state in both queue time and transfer time, never stop in queue time so that it's billed during the whole duration time. I think it's inevitable loss in data transfer process and has been digested by adf internally. You could try to adjust parallelCopies param to see if anything changes.
If you do concern the cost, you could submit feedback here to ask for statements from Azure Team.
I am having several Copy activities in Azure Data Factory Pipeline from Azure SQL Source to Azure Data Lake Store for different tables independent of each other.
I have scheduled it for every 15 mins. I am seeing a time lag of around 1 minute while triggering such as 12:00 AM jobs are triggering at 12:01 AM.
Also only 2 copy activities getting kick started at a time out of 20+ activities remaining getting triggered one by one .
Is this expected behavior? Any ways to eradicate this time lag?
According to SLA for Data Factory, Activity Runs SLA is within 4 minutes. And also a common practice is to avoid o'clock spike, especially for 12AM (UTC).
We would like to make our customers able to schedule recurring tasks on a daily, weekly and monthly basis. Linear scalability is really important to us, that is why we use Windows Azure Table Storage instead of SQL Azure. The current design is the following:
- Scheduling information is stored in a Table Storage table. For example: Task A, daily; Task B, weekly; ...
- There are worker processes, which run hourly and query this table. Then decide, they have to run a given task or not.
But what if, multiple worker roles start to run the same task?
Some other requirements:
- The worker processes can be in different time zones.
Windows Azure Queue Storage could solve all cuncurrency problems mentioned above, but it also introduces some new issues:
- How many queue items should we generate?
- What if the customer changes the recurrence rate or revokes the scheduling?
So, my question is: how to design a recurring task scheduler with multiple asynchronous workers using Windows Azure Storage?
Perhaps the new Azure Scheduler service could help?
http://www.windowsazure.com/en-us/services/scheduler/
Some thoughts:
But what if, multiple worker roles start to run the same task?
This could very well happen. To avoid this, what you could do is have a worker role instance (any worker role instance from the pool) read from table and push messages in a queue. While this instance is doing this work, all other instances wait. To decide which instance does this work, you can make use of blob lease functionality.
Some other requirements: - The worker processes can be in different
time zones.
Not sure about this. Assuming you're talking about Cloud Services Worker Roles, they could be in different data centers but all of them will be in UTC time zone.
How many queue items should we generate?
It really depends on how much work needs to be done. You could put all messages in a queue. Only a maximum of 32 messages can be dequeued from a queue by a client at a time. So if you have say 100 tasks and thus 100 messages, each instance can only read up to 32 messages from the queue in a single call to queue service.
What if the customer changes the recurrence rate or revokes the
scheduling?
That should be OK as once the task is completed you must remove the message from the queue. Next time when the task is invoked, you can read from the table again and it will give you latest information about the task from the table.
I would continue using the Azure Table Storage, but mark the process as "in progress" before a worker starts working on it. Since ATS supports concurrency which is controlled by Etags, you can be assured that two processes won't be able to start the same process
I would, however, think about retry logic when jobs fail unexpectedly and have a process that restarts job that appear to have gone orphan