Strange runtime for data flows implementing the same Azure IR - azure

My scenario is this: I have two Pipelines. Each one of them has two dataflows, and the four use the same IR.
I have triggered the pipeline number one. Its first dataflow took around four minutes to start the cluster. As soon as this dataflow finished, I triggered the second pipe. However, its first dataflow took once again almost 5 minutes to acquire the resources. But I was expecting it to use the pool that it was already available like the second dataflow did in the first pipeline.
Is there something that I'm not considering?

Related

ADF Reduce Queue Time for Copy Activity

I am trying to copy a backup file that is provided from a HTTP source, the download URL is only valid for 60 seconds and the Copy step is timing out before it can complete (timeout set to 1 min 1 second). It will complete on occasion but is very inconsistent. When it completes the step queues for around 40 seconds, other times it will be queued for over a minute and the link has expired when it eventually gets to downloading the file. It is one zipped JSON file that is being downloaded, less than 100 KB.
Both Source and Sink datasets are using a Managed VNet IR we have created (must be used on Sink due to company policy), using the AutoResolve IR on the Source and it takes even longer queueing.
I've tried all variations of 'Max concurrent connections', 'DIU' and 'Degree of copy parallelism' in the Copy activity I can think of and none seem to have any effect. It appears to be random if the queue time is short enough for the download to succeed.
Is there any way to speed up the queue process to try and get more consistent successful downloads?
Both Source and Sink datasets are using a Managed VNet IR we have
created (must be used on Sink due to company policy), using the
AutoResolve IR on the Source and it takes even longer queueing.
This is pretty much confusing. If your Source and Sink are in Managed VNet, your IR also should be in same Managed VNet for better security and performance.
As per this official document:
By design, Managed VNet IR takes longer queue time than Azure IR as we
are not reserving one compute node per service instance, so there is a
warm up for each copy activity to start, and it occurs primarily on
VNet join rather than Azure IR.
Since there is no reserved node, there isn't any possible way to speed up the queue process.

How to optimise the issues with the luigi pipeline?

I have a pipeline built with luigi. I have one luigi task which downloads data from an external service, based on a txt file with the information to fetch. As there are over 3000 of requests to the external service, the pipeline often fails, because of the large time it takes for the task to finish.
What could be done to improve the scalability of the pipeline and to make sure the pipeline doesn't fail? -threading, multiprocessing? What solution can be optimal to make sure the pipeline doesn't fail at handling big tasks, which take a long time?
I didn't provide any code example because I would need a general approach, not an example based one.

How is ForEach activity declaring success in Data Factory?

In the following scenario we have a ForEach activity running in an Azure Data Factory pipeline to copy data from source to destination.
The last CopyActivity took 4:10:33 but the ForEach activity declared Succeeded 36 Minutes later: 4:46:12.
The question is, why ForEach activity need this 36 Minutes extra?
Is it the case that the ForEach needs also to consolidate results from subactivities before declaring success or fail?
Official answer from Microsoft: ForEach activity does wait for all inner activity runs to complete. In theory, there should not be much delay on marking foreach run success after the last activity run within it succeed. However, ADF rely on partner service to execute the runs and it's possible that the partner service run into failures and could not complete foreach in time. They have build in logic to keep retry and recover but the behavior in ADF activity runs is delay. It's also possible that orchestration service fails and partner service keep retry on calling us. but usually partner service delay is the main cause here.
Our assumption: Duration time is end-to-end for the pipeline activity. That takes into account all factors like marshaling of your data flow script from ADF to the Spark cluster, cluster acquisition time, job execution, and I/O write time. Due to ADF is serverless compute, I think ForEach needs time to wait for all activities to acquire and release computing resources, but this is my guess because there are few official explanations.
So there will be a delay time, which varies according to internal activities.
Official answer from Microsoft: ForEach activity does wait for all inner activity runs to complete. In theory, there should not be much delay on marking foreach run success after the last activity run within it succeed. However, ADF rely on partner service to execute the runs and it's possible that the partner service run into failures and could not complete foreach in time. They have build in logic to keep retry and recover but the behavior in ADF activity runs is delay. It's also possible that orchestration service fails and partner service keep retry on calling us. but usually partner service delay is the main cause here.

Azure Pipelines: How to block pipeline A if pipeline B is running

I have two pipelines (also called "build definitions") in azure pipelines, one is executing system tests and one is executing performance tests. Both are using the same test environment. I have to make sure that the performance pipeline is not triggered when the system test pipeline is running and vice versa.
What I've tried so far: I can access the Azure DevOps REST-API to check whether a build is running for a certain definition. So it would be possible for me to implement a job executing a script before the actual pipeline runs. The script then just checks for the build status of the other pipeline by checking the REST-API each second and times out after e.g. 1 hour.
However, this seems quite hacky to me. Is there a better way to block a build pipeline while another one is running?
If your project is private, the Microsoft-hosted CI/CD parallel job limit is one free parallel job that can run for up to 60 minutes each time, until you've used 1,800 minutes (30 hours) per month.
The self-hosted CI/CD parallel job limit is one self-hosted parallel job. Additionally, for each active Visual Studio Enterprise subscriber who is a member of your organization, you get one additional self-hosted parallel job.
And now, there isn't such setting to control different agent pool parallel job limit.But there is a similar problem on the community, and the answer has been marked. I recommend you can check if the answer is helpful for you. Here is the link.

Make the azure batch job schedule not wait on the previous iteration

I have an Azure Batch service set up with a job schedule that runs every minute.
The job manager task creates 3-10 tasks within the same job.
Sometimes, one of these tasks within the job may take extremely long to complete but usually are very fast.
In the event that one of the tasks takes long to apply, the next iteration of the job manager task does not begin in that case. It basically waits till all the tasks from the previous iteration have completed.
Is there a way to ensure that the job schedule keeps creating a version of the job every minute even if all the tasks from its previous iteration have not been completed?
I know one option is to make the job manager task create additional jobs instead of tasks. But preferably, I was hoping there is some configuration at the job schedule level that I can turn on that will allow the schedule to create tasks without the dependency of completion on the previous job.
This seems like more towards design question, AFAIK, No, the duplicate active job names should not be doable from az batch perspective. (I will get corrected if at all this is doable somehow)
Although in order to further think this you can read through various design recommendations via Azure batch technical overview page or posts like:
How to use Azure Batch in an event based design and terminate/cleanup finished jobs or
Add Tasks to a running Azure batch job and manually control termination
I think simplicity will be better like handling each iteration with unique job name or some thing of other sort but you will know your scenario better. Hope this helps.
Currently, a Job Schedule can have at most one active Job under it at any given time (link) so the behavour you're seeing is expected.
We don't have any simple feature you can just "turn on" to achieve concurrent jobs from a single job schedule - but I do have a suggestion:
Instead of using the JobSchedule to run all the processing directly, use it to create "worker" jobs that do the processing.
E.g.
At 10:03 am, your job schedule triggers to create job processing-20191031-1003.
At 10:04 am, your job schedule triggers to create job processing-20191031-1004.
At 10:05 am, your job schedule triggers to create job processing-20191031-1005.
and so on
Because the only thing your job schedule does is create another job, it will finish very quickly, ensuring the next job is created on time.
Since your existing jobs already create a variable number of tasks (you said 3-10 tasks, above), I'm hoping this won't be a very complex change for your code.
Note that you will need to ensure your concurrent worker jobs don't step on each others toes by trying to do the same work multiple times.

Resources