Confusion about {{run_id}} and {{parent_run_id}} variables for Databricks jobs (Azure) - databricks

In Databricks jobs on Azure you can use the {{run_id}} and {{parent_run_id}} variables for a specific run: https://docs.databricks.com/workflows/jobs/jobs.html
For Databricks jobs with only two or more tasks, then {{run_id}} seems to correspond to task_run_id and {{parent_run_id}} seems to correspond to the job_run_id.
For Databricks jobs with only one task, then {{parent_run_id}} seems to correspond to the task_run_id, but what does the {run_id}} correspond to? Is that the job_run_id?

Related

How to manage multiple environments in pyspark clusters?

I want to:
Have multiple python environments in my pyspark dataproc cluster
Specify while submitting the job which environment I want to execute my submitted job in
I want to persist the environments so that I can use them on an as-needed basis. I won't tear down the cluster but I would occasionally stop it. I want the environments to persist the way they do on a normal VM
Currently, I know how to submit the job with the entire environment with a conda pack but, the problem with that is it would ship the entire environment payload each time I want to submit the job and does not address the issue of handling multiple environments for projects

Solving Timeout Issue with Azure Synapse/ADF Pipeline of Pipelines

Context
We're creating a pipeline of Spark Jobs in Azure Synapse (much like in Azure Data Factory) that read in data from various databases and merge it into a larger dataset. Originally, we had a single pipeline that worked, with many Spark Jobs leading into others. As part of a redesign, we were thinking that we would create a pipeline for each individual Spark job, so that we can create various orchestration pipelines. If the definition of a Spark job changes, we only have to change the definition file in one place.
Problem
When we run our "pipeline of pipelines", we get an error that we don't get with the single pipeline. The error:
Is a series of timeout errors like: Timeout waiting for idle object
Occurs in different areas of the pipeline on different runs
This results in the failure of the pipeline as a whole.
Questions
What is the issue happening here?
Is there a setting that could help solve it? (Azure Synapse does not provide many levers)
Is our pipeline-of-pipelines construction an anti-pattern? (Doesn't seem like it should be since Azure allows it)
Hypotheses
From the post here: are we running out of available connections in Azure? (how would you solve that?)

What is the concept of concurrent pipelines (One parallel job in Azure Pipeline lets you run a single build or release job at any given time)?

I am reading about Concurrent pipelines in azure.
Concurrent pipelines
You can run concurrent pipelines (also called parallel jobs) in Azure
Pipelines. One parallel job in Azure Pipeline lets you run a single
build or release job at any given time. This rule is true whether you
run the job on Microsoft-hosted or self-hosted agents. Parallel jobs
are purchased at the organization level, and they are shared by all
projects in an organization.
My understanding is that - the azure build pipeline is organized into jobs (either agent/agentless jobs). Each job contains tasks. On auto/manual trigger the build pipeline runs and I thought that the number of pipelines that can run in parallel (assuming each pipeline has got only 1 job in them) depends on the availability of build agents (machines - either azure or hosted).
So what exactly is the concept of concurrent pipelines? What is the meaning of "One parallel job in Azure Pipeline lets you run a single build or release job at any given time."? In simple English, buying One parallel job should allow us to either a) run 2 build pipelines (assuming each pipeline contains only 1 job) or b) 1 pipeline with 2 jobs in parallel simultaneously. But this depends on availability of build agent as each pipeline (with 1 job) or 1 pipeline with 2 jobs will need 2 machines to run parallelly. Does it also mean that by default (free of charge) only one build pipeline can run at a time? There seems to be confusion between parallel job and parallel pipeline because one pipeline can have parallel job.
I need clarity on this topic with respect to pipeline/job/parallel pipeline/parallel job/count of build agents/count of parallel jobs.
I need clarity on this topic with respect to pipeline/job/parallel
pipeline/parallel job/count of build agents/count of parallel jobs.
Check Relationship between jobs and parallel jobs:
1.When you define a pipeline, you can define it as a collection of jobs. When a pipeline runs, you can run multiple jobs as part of that pipeline.
2.Each job consumes a parallel job that runs on an agent. When there aren't enough parallel jobs available for your organization, the jobs are queued up and run one after the other.
So if we have a pipeline with two jobs: When I queue the pipeline,these two jobs can't run at the same time if we only have one parallel job.
There're different count of parallel jobs available for microsoft-hosted and self-hosted agents, you can follow View available parallel jobs to check the parallel jobs you have.
And for count of build agents, there's no count limit for microsoft-hosted agents. If you're meaning self-hosted agents, you can own many agents in your agent pool.(The limit of count is something we won't meet in normal situation.) We can also install more than one agents in same local machine, see Can I install multiple self-hosted agents on the same machine?.
Hope all above helps :)
Well, the agents don't run pipelines.
They run jobs.
So if you are allowed to run "2 concurrent pipelines", it must mean 2 parallel jobs.
These jobs can be in a single pipeline if they are allowed to run in parallel (i.e. the other is not dependent on the first one).
Yes, on the free version it seems only one job can run in parallel.
I'm not sure when this was released, but there is a setting in the Pre-deployment conditions of an environment that fixed this for me. (Same place you'd find Triggers, Pre-depoyment approvals, Gates)
Pre-deployment conditions >> Deployment queue settings >> Number of parallel deployments = 1 Specific = Maximum number of parallel deployments = 1.

Is it possible to know the resources used by a specific Spark job?

I'm drawing on ideas of using a multi tenant Spark cluster. The cluster execute jobs on demand for a specific tenant.
Is it possible to "know" the specific resources used by a specific job (for payment reasons)? E.g. if a job requires that several nodes in kubernetes is automatically allocated is it then possible to track which Spark jobs (and tenant at the end) that initiated these resource allocations? Or, jobs are always evenly spread out on allocated resources?
Tried to find information at the Apache Spark site and else where on the internet without success.
See https://spark.apache.org/docs/latest/monitoring.html
You can save data from Spark History Server as json and then write your own resource calc stuff.
It is Spark App you mean.

Which of my Databricks notebook uses the cluster nodes?

I run several notebooks on Azure Databricks Spark cluster at the same time.
How can I see the cluster nodes usage rate of each notebook \ app over a period of time?
Both the "Spark Cluster UI - Master" and "Spark UI" tabs didn't provide such information
There is no automated/built in support today for isolating the usage of particular notebooks on Databricks.
That being said, one approach would be to use the Ganglia Metrics available for Databricks clusters.
If you run both notebooks at the same time it will be difficult to discern which is responsible for a specific quantity of usage. I would recommend running one notebook to completion and taking note of the usage on the cluster. Then, run the second notebook to completion and observe its usage. You can then compare the two and have a baseline for how each one is utilizing resources on the cluster.

Resources