Data Factory - Foreach activity: run in parallel but sequentially - azure

I'm creating a ADF pipeline and I'm using a for each activity to run multiple databricks notebook.
My problem is that two notebooks have dependencies on each other.
That is, a notebook has to run before the other, because it has dependency. I know that the for each activity can be executed sequentially and by batch. But the problem is that when running sequentially it will run one by one, that is, as I have partitions, it will take a long time.
What I wanted is to run sequentially but by batch. In other words, I have a notebook that will run with ES, UK, DK partitions, and I wanted it to run in parallel these partitions of this notebook and to wait for the total execution of this notebook and only then would it start to run the other notebook by the same partitions. If I put it by batch, it doesn't wait for full execution, it starts running the other notebook randomly.
The part of the order of notebooks I get through a config table, in which I specify which order they should run and then I have a notebook that defines my final json with that order.
Config Table:
sPath
TableSource
TableDest
order
path1
dbo.table1
dbo.table1
1
path2
dbo.table2
dbo.table2
2
This is my pipeline:
and the execution I wanted by batch and sequentially but it is not possible to select by sequential and batch count at the same time.
Can anyone please help me in achieving this?
Thank you!

I have tried to reproduce this. For running for-each sequentially and batchwise, we need to have two pipelines- one nested inside the other pipeline. Outer pipeline is used for running sequentially and inner pipeline is for running batchwise. Below are the steps
Took a sample config file as in below image.
Pipeline 1 is considered as the outer pipeline. In that Lookup activity is used to select only sortorder field data in increasing order. Sortorder value will be passed as a parameter to child pipeline sequentially.
select distinct sortorder from config_table order by sortorder
For each activity is added after the lookup activity. We use this for sequential run. Thus, Sequential is checked and in items text box, output of lookup activity is given.
#activity('Lookup1').output.value
-Inside foreach activity, pipeline2 is invoked with execute pipeline activity. pipeline parameter pp_sortorder is added in child pipeline pipeline2
In pipeline2, Lookup activity is added with dataset referring the config table with sortorder value got from pipeline1
select * from config_table where sortorder=
#{pipeline().parameters.pp_sortorder}
-Next to Lookup, Foreach is added and in items, lookup activity output is given and Batch count of 5 is given here (Batch count can be increased as per requirement)
Stored Procedure activity is added inside for each activity for checking parallel processing.
After setting up all these, **Pipeline 1 ** is executed. Execute pipeline activity of pipeline1 is run sequentially and Execute stored procedure activity of pipeline 2 has run simultaneously.
pipeline1 Output status
second execute pipeline is started once first activity is ended
pipeline2 Output status
All Stored procedure activity has started to execute simultaneously

Related

Foreach activity in parallel mode running multiple times

In my ADF pipeline I am executing a foreach activity containing child pipelines, from a lookup.
The foreach activity is supposed to run parallelly (Sequential Execution is not checked). Ideally if there are 10 rows as output to my lookup, the foreach should have ran total 10 parallel runs.
However I could see that the same 10 parallel runs have ran 4 times, so I ended up with each child pipeline getting executed 4 times instead of once.
Note: I have triggered via a trigger and there is no set variable inside the foreach activity which is used in the child pipeline. The attributes are transferred as #item()
Is there some properties that I should set to eliminate this multiple runs?

Am I using switch/case wrong here to control?

I am trying to check the logs and depending the last log, run a different step in the transformation. Am I supposed to use some other steps or am I making another mistake here?
For example, if the query returns 1 I want execute SQL script to run, for 2 I want execute SQL script 2 to run and for 3 I want transformation to abort. But it keeps running all the steps even if only one value returns from the CONTROL step.
The transformation looks like this
And the switch/case step looks like this
It looks like it's correctly configured, but keep in mind that in a transformation all steps are initiated at the beginning of the transformation, waiting to receive the streaming data from the previous step. So the Abort and Execute script steps are going to be started as soon as the transformation is started, if they don't need data from the previous step to run, they are going to run at the beginning.
If you want the scripts to be executed depending on the result of the CONTROL output, you'll need to use a job, that runs the steps (actions) sequencially:
A transformation runs the CONTROL step and afterwards you put a "Copy rows to result" step to make the data produced from the CONTROL step available to the job
After the transformation, you use a "Simple evaluation" action in the job, to determine which script (or abort) to run. Jobs also have an "Execute SQL Script" action, so you can put it afterwards.
I'm supposing your CONTROL step only produces one row, if the output is more than one row, the "Simple evaluation" action won't do the job, you'll have to design one of various transformations to execute for each row of the previous transformation, running what you need.

aggregate logstash filter with "multiple pipelines"

I would like to let httpd access_log entries be processed by two different logstash filters.
One of them is the "aggregate" filter, which is known to only work properly with a single worker thread. However, the other filter (let's call it "otherfilter") should be allowed to work with several worker threads, so that there is no loss of performance.
To accomplish this I would like to use the "multiple pipeline" feature of logstash. Basically one pipeline should read the data ("input pipeline") and distribute it to two other pipelines on which the two mentioned filters operate (let's call them "aggregate pipeline" and "otherfilter pipeline").
First tests have shown, that the results of the aggregate filter are not correct, if the input pipeline is set up to work with more than one thread. That is, when aggregating in the interval of 60 seconds an events counter sometimes shows more and sometimes less events as acutally occurred. The problem seems that events arrive "not ordered" in the aggregate filter, and thus, intervals (whose start and end are determined based on timestamp field) are incorrect.
So I ask myself whether what I want to achieve is at all feasible with "multiple pipelines"?
You can breakup a single pipeline in multiple pipelines, but since you want to use the aggregate filter you need to make sure that everything that happens before the event enters the aggregate filter is running with only one worker.
For example, if you broke up your pipeline into pipeline A, which is your input, pipeline B, which is your aggregate filter, and pipeline C, which is your other filter.
This will only work if:
Pipeline A is running with only one worker.
Pipeline B is running with only one worker.
Pipeline C runs after pipeline B and don't rely on the orders of the events.
If your input pipeline is running with more than one worker you can't guarantee the order of the events when they enter your aggregate pipeline, so basically your input and your aggregate should be in the same pipeline and then you can direct the output to the other filter pipeline that runs with more than one worker.

Azure data factory activity execute after all other copy data activities have completed

I've got an Azure Data Factory V2 pipeline with multiple Copy Data activities that run in parallel.
I have a Pause DW web hook pauses an Azure data warehouse after each run. This activity is set to run after the completion of one of the longest running activities in the pipeline. The pipeline is set to trigger nightly.
Unfortunately, the time taken to run copy data activities varies because it depends on transactions that have been processed in the business, which varies each day. This means, I can't predict which activity of those that run in parallel will finish last. This means, often the whole pipeline fails because the DW has been paused before some of the activities have started.
What's the best way of running an activity only after all other activities in the pipeline have completed?
I have tried to add an If activity to the pipeline like this:
However, I then run into this error during validation:
If Condition1
The output of activity 'Copy small tables' can't be referenced since it has no output.
Does anyone have any idea how I can move this forwards?
thanks
Just orchestrate all your parallel activities towards PAUSE DWH activity. Then it will be executed after all your activities are completed.
I think you can use the Execute pipeline activity .
Let the trigger point to the new pipeline which has the "Execute activity " which points to the current ADF with the copy activity , please do select the option Advanced -> Wait for completion . Once the execute pipeline is done it should to move to the webhook activity which should have the logic to pause the DW .
Let me know how this goes .

Running same child job with different context parameters at the same time in Talend

I have encountered a use case where i am fetching rows from tMyslInput and then iterating it one by one to complete a subjob for each of the rows retrieved.
tMysqlInput-----> iterate -------> (job having multiple components such as writing a file, logging it, entry into the database and different processes like this, which is a complete process in itself).
Problem is that since the subjob after iterate link itself takes care of everything, i just want to fork as many subjobs as number of rows fetched from tMysqlInput with different context parameters.
So i tried to do following
tMysqlInput ------>iterate(*n, where n is number of rows fetched)----->(job)
But here what is happening , threads are reading each other context variables hence ending up writing similar context in similar files, same db entry etc..
I want to parrallelize the child job depending on number of rows fetched with threads being in synchronize.
tMysqlInput query lets say, select file_id, input_path , output_path from some table where status='copied';
lets say 4 tuples i got then i want to iterate 4 tuples at the same time. Just execute the child job and let the child job execute on its own.
thanks
try this -
1) Click on Iterate Link - and in component properties Tab you can see Basic Settings - Enable Parallel Execution checkbox (once you check this checkbox) you can enter values of number of iterations you want to run in parallel. This could be number of rows returned by tMysqlInput component (however total number of rows variable will have value AFTER execution of tMysqlInput - globalMap.get("tMysqlInput_X_NB_LINE"))
2) you can pass context variables values in sub jobs - for this first you have to define context variables in your sub job, and then once you have it after iterate link tSubJob click on component properties tab and you will see context Param (table/grid) where you click on + symbol to select context variable and assign its value.

Resources