I have a total of 5 notebooks
The first is the Main class notebook. The remaining Four are sub/child Notebooks.
Let the names of notebooks be:(all are in scala language)
mainclass,
child1,
child2,
child3,
child4
I want to call child Notebooks based on IF conditions from the Main class notebook and execute
concurrently/parallelly.
for example:
In main class
var child1="Y"
var child2="Y"
var child3="N"
var child4="N"
I want to call notebooks which as flag as "Y" and run concurrently.
if(child1=="Y")
same for all notebooks
Kindly suggest a way to do this.
Thanks!
Calling a notebook from within a notebook will not result in a concurrent run as u desire
Since you are on Azure , you should look at Azure Data Factory
You can build a pipeline based on the parameters you supply to control the flow of executions for each of the notebooks along with scheduling and other utilities provided within ADF
Related
I have a job with multiple tasks like Task1 -> Task2. I am trying to call the job using api "run now". Task details are below
Task1 - It executes a Note Book with some input parameters
Task2 - It executes a Note Book with some input parameters
So, how I can provide parameters to job api using "run now" command for task1,task2?
I have a parameter "lib" which needs to have values 'pandas' and 'spark' task wise.
I know that we can give unique parameter names like Task1_lib, Task2_lib and read that way.
current way:
json = {"job_id" : 3234234, "notebook_params":{Task1_lib: a, Task2_lib: b}}
Is there a way to send task wise parameters?
It's not supported right now - parameters are defined on the job level. You can ask your Databricks representative (if you have) to communicate this ask to the product team who works on the Databricks Workflows.
I see my Spark application is using FAIR scheduler:
But I can't confirm whether it is using two pools I set up (pool1, pool2). Here is a thread function I implemented in PySpark which is called twice - one with "pool1" and the other with "pool2".
def do_job(f1, f2, id, pool_name, format="json"):
spark.sparkContext.setLocalProperty("spark.scheduler.pool", pool_name)
...
I thought the "Stages" menu is supposed to show the pool info but I don't see it. Does that mean the pools are not set up correctly or am I looking at the wrong place?
I am using PySpark 3.3.0 on top of EMR 6.9.0
You can confirm like this diagram.
pls refer my article I created 3 pools like module1 module2 module3 based on certin logic.
Each one is using specific pool.like above.. based on this I created below diagrams
Note : Please see the verification steps in the article I gave
I'm creating a ADF pipeline and I'm using a for each activity to run multiple databricks notebook.
My problem is that two notebooks have dependencies on each other.
That is, a notebook has to run before the other, because it has dependency. I know that the for each activity can be executed sequentially and by batch. But the problem is that when running sequentially it will run one by one, that is, as I have partitions, it will take a long time.
What I wanted is to run sequentially but by batch. In other words, I have a notebook that will run with ES, UK, DK partitions, and I wanted it to run in parallel these partitions of this notebook and to wait for the total execution of this notebook and only then would it start to run the other notebook by the same partitions. If I put it by batch, it doesn't wait for full execution, it starts running the other notebook randomly.
The part of the order of notebooks I get through a config table, in which I specify which order they should run and then I have a notebook that defines my final json with that order.
Config Table:
sPath
TableSource
TableDest
order
path1
dbo.table1
dbo.table1
1
path2
dbo.table2
dbo.table2
2
This is my pipeline:
and the execution I wanted by batch and sequentially but it is not possible to select by sequential and batch count at the same time.
Can anyone please help me in achieving this?
Thank you!
I have tried to reproduce this. For running for-each sequentially and batchwise, we need to have two pipelines- one nested inside the other pipeline. Outer pipeline is used for running sequentially and inner pipeline is for running batchwise. Below are the steps
Took a sample config file as in below image.
Pipeline 1 is considered as the outer pipeline. In that Lookup activity is used to select only sortorder field data in increasing order. Sortorder value will be passed as a parameter to child pipeline sequentially.
select distinct sortorder from config_table order by sortorder
For each activity is added after the lookup activity. We use this for sequential run. Thus, Sequential is checked and in items text box, output of lookup activity is given.
#activity('Lookup1').output.value
-Inside foreach activity, pipeline2 is invoked with execute pipeline activity. pipeline parameter pp_sortorder is added in child pipeline pipeline2
In pipeline2, Lookup activity is added with dataset referring the config table with sortorder value got from pipeline1
select * from config_table where sortorder=
#{pipeline().parameters.pp_sortorder}
-Next to Lookup, Foreach is added and in items, lookup activity output is given and Batch count of 5 is given here (Batch count can be increased as per requirement)
Stored Procedure activity is added inside for each activity for checking parallel processing.
After setting up all these, **Pipeline 1 ** is executed. Execute pipeline activity of pipeline1 is run sequentially and Execute stored procedure activity of pipeline 2 has run simultaneously.
pipeline1 Output status
second execute pipeline is started once first activity is ended
pipeline2 Output status
All Stored procedure activity has started to execute simultaneously
In the RunDetails Jupyter module, what does the table (see screenshot below) represent?
The RunDetails(run_instance).show() method from the azureml-widgets package shows the progress of your job along with streaming the log files. The widget is asynchronous and provides updates until the training run finishes.
Since the output shown is specific to your Pipeline run, you can troubleshoot it further from the logs from pipeline runs, which can be found in either the Pipelines or Experiments section of the studio.
I have a DAG with one task that fetches the data from the API. I want that task to fetch the data only for certain time interval and marks itself as SUCCESS so that the tasks after that starts running.
Please note that the tasks below are dependent on the tasks which I want to mark SUCCESS. I know I can mark the task SUCCESS manually from CLI or UI but I want to do it automatically.
Is it possible to do that programmatically using python in Airflow?
You can set status of task using python code, like this:
def set_task_status(**kwargs):
execution_date = kwargs['execution_date']
ti = TaskInstance(HiveOperatorTest, execution_date)
ti.set_state(State.SUCCESS)