Foreach activity in parallel mode running multiple times - azure

In my ADF pipeline I am executing a foreach activity containing child pipelines, from a lookup.
The foreach activity is supposed to run parallelly (Sequential Execution is not checked). Ideally if there are 10 rows as output to my lookup, the foreach should have ran total 10 parallel runs.
However I could see that the same 10 parallel runs have ran 4 times, so I ended up with each child pipeline getting executed 4 times instead of once.
Note: I have triggered via a trigger and there is no set variable inside the foreach activity which is used in the child pipeline. The attributes are transferred as #item()
Is there some properties that I should set to eliminate this multiple runs?

Related

Data Factory - Foreach activity: run in parallel but sequentially

I'm creating a ADF pipeline and I'm using a for each activity to run multiple databricks notebook.
My problem is that two notebooks have dependencies on each other.
That is, a notebook has to run before the other, because it has dependency. I know that the for each activity can be executed sequentially and by batch. But the problem is that when running sequentially it will run one by one, that is, as I have partitions, it will take a long time.
What I wanted is to run sequentially but by batch. In other words, I have a notebook that will run with ES, UK, DK partitions, and I wanted it to run in parallel these partitions of this notebook and to wait for the total execution of this notebook and only then would it start to run the other notebook by the same partitions. If I put it by batch, it doesn't wait for full execution, it starts running the other notebook randomly.
The part of the order of notebooks I get through a config table, in which I specify which order they should run and then I have a notebook that defines my final json with that order.
Config Table:
sPath
TableSource
TableDest
order
path1
dbo.table1
dbo.table1
1
path2
dbo.table2
dbo.table2
2
This is my pipeline:
and the execution I wanted by batch and sequentially but it is not possible to select by sequential and batch count at the same time.
Can anyone please help me in achieving this?
Thank you!
I have tried to reproduce this. For running for-each sequentially and batchwise, we need to have two pipelines- one nested inside the other pipeline. Outer pipeline is used for running sequentially and inner pipeline is for running batchwise. Below are the steps
Took a sample config file as in below image.
Pipeline 1 is considered as the outer pipeline. In that Lookup activity is used to select only sortorder field data in increasing order. Sortorder value will be passed as a parameter to child pipeline sequentially.
select distinct sortorder from config_table order by sortorder
For each activity is added after the lookup activity. We use this for sequential run. Thus, Sequential is checked and in items text box, output of lookup activity is given.
#activity('Lookup1').output.value
-Inside foreach activity, pipeline2 is invoked with execute pipeline activity. pipeline parameter pp_sortorder is added in child pipeline pipeline2
In pipeline2, Lookup activity is added with dataset referring the config table with sortorder value got from pipeline1
select * from config_table where sortorder=
#{pipeline().parameters.pp_sortorder}
-Next to Lookup, Foreach is added and in items, lookup activity output is given and Batch count of 5 is given here (Batch count can be increased as per requirement)
Stored Procedure activity is added inside for each activity for checking parallel processing.
After setting up all these, **Pipeline 1 ** is executed. Execute pipeline activity of pipeline1 is run sequentially and Execute stored procedure activity of pipeline 2 has run simultaneously.
pipeline1 Output status
second execute pipeline is started once first activity is ended
pipeline2 Output status
All Stored procedure activity has started to execute simultaneously

Adf costing for parallel running pipelines

I have main adf pipeline which has several child pipelines. Those pipelines are getting data from different sources from azure blob and loading data into different snowflake tables.
Individually each child pipeline run for average of 4 mins. However they are running parallelly
under main pipeline where main pipeline runs for around 8 mins. If I sum each child pipeline execution time it will total up to 40 mins.
So will I be charged for 8 mins of parallel execution or 40 mins of pipeline run based on total of all child pipeline runs.
I have already checked in cost-analysis and it does not give costing based on individual pipeline
I have validated the scenario asked in the question . It seems adf charges for each pipeline run . I recreated the scenario as below
Created one main pipeline and ran 3 child pipelines parallelly within main pipeline for couple of days. Main pipeline ran for 17 mins
Removed 2 child pipelines and ran Main pipeline with 1 child pipeline for couple of days. Main pipeline ran for 17 mins.
In 2nd scenario it cost around 1.490376279 INR per day while it cost 8.141834833 INR in scenario 1. This validates that ADF costing happens for each pipeline irrespective of how it runs

How to limit/set parallelism of QueueTriggers that get executed with Azure WebJobs

I have 5 QueueTrigger jobs within a single Function.cs file. 3 jobs must execute sequentially (synchronously) and 2 can process up to 16 items at a time.
From what I can decode from the documentation the AddAzureStorage queue configuration method only supports setting this parallelism for all the jobs:
.AddAzureStorage(queueConfig =>
{
queueConfig.BatchSize = 1;
});
The above now sets that all jobs can process only one item at a time. If I set it to 16, then all jobs will run in parallel which is not what I want either.
Is there a way to set the BatchSize per QueueTrigger webjob or will I have to set it to 16 and use locks on those I don't want to run in parallel to achieve the desired behaviour?

aggregate logstash filter with "multiple pipelines"

I would like to let httpd access_log entries be processed by two different logstash filters.
One of them is the "aggregate" filter, which is known to only work properly with a single worker thread. However, the other filter (let's call it "otherfilter") should be allowed to work with several worker threads, so that there is no loss of performance.
To accomplish this I would like to use the "multiple pipeline" feature of logstash. Basically one pipeline should read the data ("input pipeline") and distribute it to two other pipelines on which the two mentioned filters operate (let's call them "aggregate pipeline" and "otherfilter pipeline").
First tests have shown, that the results of the aggregate filter are not correct, if the input pipeline is set up to work with more than one thread. That is, when aggregating in the interval of 60 seconds an events counter sometimes shows more and sometimes less events as acutally occurred. The problem seems that events arrive "not ordered" in the aggregate filter, and thus, intervals (whose start and end are determined based on timestamp field) are incorrect.
So I ask myself whether what I want to achieve is at all feasible with "multiple pipelines"?
You can breakup a single pipeline in multiple pipelines, but since you want to use the aggregate filter you need to make sure that everything that happens before the event enters the aggregate filter is running with only one worker.
For example, if you broke up your pipeline into pipeline A, which is your input, pipeline B, which is your aggregate filter, and pipeline C, which is your other filter.
This will only work if:
Pipeline A is running with only one worker.
Pipeline B is running with only one worker.
Pipeline C runs after pipeline B and don't rely on the orders of the events.
If your input pipeline is running with more than one worker you can't guarantee the order of the events when they enter your aggregate pipeline, so basically your input and your aggregate should be in the same pipeline and then you can direct the output to the other filter pipeline that runs with more than one worker.

Azure data factory activity execute after all other copy data activities have completed

I've got an Azure Data Factory V2 pipeline with multiple Copy Data activities that run in parallel.
I have a Pause DW web hook pauses an Azure data warehouse after each run. This activity is set to run after the completion of one of the longest running activities in the pipeline. The pipeline is set to trigger nightly.
Unfortunately, the time taken to run copy data activities varies because it depends on transactions that have been processed in the business, which varies each day. This means, I can't predict which activity of those that run in parallel will finish last. This means, often the whole pipeline fails because the DW has been paused before some of the activities have started.
What's the best way of running an activity only after all other activities in the pipeline have completed?
I have tried to add an If activity to the pipeline like this:
However, I then run into this error during validation:
If Condition1
The output of activity 'Copy small tables' can't be referenced since it has no output.
Does anyone have any idea how I can move this forwards?
thanks
Just orchestrate all your parallel activities towards PAUSE DWH activity. Then it will be executed after all your activities are completed.
I think you can use the Execute pipeline activity .
Let the trigger point to the new pipeline which has the "Execute activity " which points to the current ADF with the copy activity , please do select the option Advanced -> Wait for completion . Once the execute pipeline is done it should to move to the webhook activity which should have the logic to pause the DW .
Let me know how this goes .

Resources