I am having several Copy activities in Azure Data Factory Pipeline from Azure SQL Source to Azure Data Lake Store for different tables independent of each other.
I have scheduled it for every 15 mins. I am seeing a time lag of around 1 minute while triggering such as 12:00 AM jobs are triggering at 12:01 AM.
Also only 2 copy activities getting kick started at a time out of 20+ activities remaining getting triggered one by one .
Is this expected behavior? Any ways to eradicate this time lag?
According to SLA for Data Factory, Activity Runs SLA is within 4 minutes. And also a common practice is to avoid o'clock spike, especially for 12AM (UTC).
Related
My scenario is this: I have two Pipelines. Each one of them has two dataflows, and the four use the same IR.
I have triggered the pipeline number one. Its first dataflow took around four minutes to start the cluster. As soon as this dataflow finished, I triggered the second pipe. However, its first dataflow took once again almost 5 minutes to acquire the resources. But I was expecting it to use the pool that it was already available like the second dataflow did in the first pipeline.
Is there something that I'm not considering?
I have a pipeline with a few copy activities. Some of those activities are in charge of copying large amounts of data from a storage account to the same storage account but in a compressed manner (I'm talking about a few TB of data).
After running the pipeline for a few hours I noticed that some activities show "Queue" time on the monitoring blade and I was wondering what can be the reason for that "Queue" time. And more importantly if I'm being billed for that time also because from what I understand my ADF is not doing anything.
Can someone shed some light? :)
(Posting this as an answer because of the comment chars limit)
After a long discussion with Azure Support and reaching out to someone at the ADF product team I got some answers:
1 - The queue time is not being billed.
2 - Initially, the orchestration ADF system puts the job in a queue and it gets "queue time" until the infrastructure picks it up and start the processing part.
3 - In my case the queue time was increasing after the job started because of a bug in the underlying backend executor (it uses Azure Batch). Apparently the executors were crashing and my job was suffering from "re-pickup" time, thus increasing the queue time. This explained why after some time I started to see that the execution time and the transferred data were decreasing. The ETA for this bugfix is at the end of the month. Additionally the job that I was executing timed out (after 7 days) and after checking the billing I confirmed that I wasn't charged a dime for it.
Based on the the chart in this ADF Monitor, you could find the same metrics in the example.
In fact,it's metrics in the executionDetails parameter.Queue Time+ Transfer Time= Duration Time.
More details on the stages copy activity goes through, and the
corresponding steps, duration, used configurations, etc. It's not
recommended to parse this section as it may change.
Please refer to the Parallel Copy, copy activity will create parallel tasks to transfer data internally. Activities are all in active state in both queue time and transfer time, never stop in queue time so that it's billed during the whole duration time. I think it's inevitable loss in data transfer process and has been digested by adf internally. You could try to adjust parallelCopies param to see if anything changes.
If you do concern the cost, you could submit feedback here to ask for statements from Azure Team.
We have about 190 hourly usage files that need to arrive in the data lake in a 24 hour period before we can kick off our pipeline which starts off with an analytics activity. We have had this pipeline run on a scheduler on an estimated time of when we expect all files to have arrived but doesn't always happen so we would need to re-run the slices for the missing files.
Is there a more efficient way to handle this and not have the pipeline on a schedule and have it triggered by the event that all files have arrived in the datalake.
TIA for input!
You can add an Event Trigger when a new blob is created (or deleted). We do this in production with a logic app, but data factory V2 appear to support it now as well. The benefit is that you don't have to estimate the proper frequency, you can just execute when necessary.
NOTE: there is a limit to the number of concurrent pipelines you can have executing, so if you dropped all 190 files into blob storage at once, you may run into resource availability issues.
I have scheduled the Azure ML Batch Job via Azure Data Factory to run daily at 12:00 AM UTC.
Don't know what is the issue, but it is failing for every month's 3rd day, otherwise it runs perfectly.
Anybody facing same issue?
For September
For October
It looks like ADF is successfully invoking ML and reporting back the "not converging" error. Could there be something specific in the input data that could cause this problem? Is there anything in the ML model that is handling dates as monthly that could be impacted by daily execution, around the start/end of the month (especially if there is any data offset or delay)?
It is likely data related. The error is being returned from the batch execution system when the model is trying to score the data. I would look for duplicate Ids being inserted. Or any specific data that is being passed that could cause problems for this model.
Having the job id and the Azure region this is running would help us look up the specific error.
I have created a web application in which I'm using sails.js framework and mongodb for back-end purpose. I want to implement a cron job running at midnight of each day so that I can calculate previous day's power consumption of a property. This data has to be used for reporting purpose.
The cron job has to run w.r.t properties location, i.e, if 2 properties are located at different time zones, I have to run the job when midnight occurs for each time zone.