Job_run_id for Databricks jobs that only contain 1 task - databricks

It seems like the task_run_id becomes the run_id for Databricks jobs (Azure) that only have 1 task, whereas the job_run_id becomes the run_id for Databricks jobs that have 2 or more tasks. My question is: Where can I find the job_run_id for Databricks jobs (Azure) that only contain 1 task?

The task_run_id might simply be an indication that a particular job contains only single task. Either it is the job_run_id or the task_run_id, both indicate the run_id for a job run.
You can verify this using the jobs run list that can be obtained using Jobs API. Use the following code:
import requests
import json
auth = {"Authorization": "Bearer <databricks_access_token>"}
response = requests.get('https://<databricks_workspace_url>/api/2.0/jobs/runs/list', headers=auth).json()
print(response)
You can navigate to your job run and check that the job's run id and the task_run_id are same:

Related

Can we set task wise parameters using Databricks Jobs API "run-now"

I have a job with multiple tasks like Task1 -> Task2. I am trying to call the job using api "run now". Task details are below
Task1 - It executes a Note Book with some input parameters
Task2 - It executes a Note Book with some input parameters
So, how I can provide parameters to job api using "run now" command for task1,task2?
I have a parameter "lib" which needs to have values 'pandas' and 'spark' task wise.
I know that we can give unique parameter names like Task1_lib, Task2_lib and read that way.
current way:
json = {"job_id" : 3234234, "notebook_params":{Task1_lib: a, Task2_lib: b}}
Is there a way to send task wise parameters?
It's not supported right now - parameters are defined on the job level. You can ask your Databricks representative (if you have) to communicate this ask to the product team who works on the Databricks Workflows.

Airflow dags - reporting of the runtime for tracking purposes

I am trying to find a way to capture the dag stats - i.e run time (start time, end time), status, dag id, task id, etc for various dags and their task in a separate table
found the default logs which goes to elasticsearch/kibana, but not a simple way to pull the required logs from there back to the s3 table.
building a separate process to load those logs in s3 will have replicated data and also there will be too much data to scan and filter as tons of other system-related logs are generated as well.
adding a function to each dag - would have to modify each dag
what other possibilities are to get it don't efficiently, of any other airflow inbuilt feature can be used
You can try using Ad Hoc Query available in Apache airflow.
This option is available at Data Profiling -> Ad Hoc Query and select airflow_db
If you wish to get DAG statistics such as start_time,end_time etc you can simply query in the below format
select start_date,end_date from dag_run where dag_id = 'your_dag_name'
The above query returns start_time and end_time details of the DAG for all the DAG runs. If you wish to get details for a particular run then you can add another filter condition like below
select start_date,end_date from dag_run where dag_id = 'your_dag_name' and execution_date = '2021-01-01 09:12:59.0000' ##this is a sample time
You can get this execution_date from tree or graph views. Also you can get other stats like id,dag_id,execution_date,state,run_id,conf as well.
You can also refer to https://airflow.apache.org/docs/apache-airflow/1.10.1/profiling.html#:~:text=Part%20of%20being%20productive%20with,application%20letting%20you%20visualize%20data. link for more details.
You did not mention do you need this information real time or in batches.
Since you do not want to use ES logs either, you can try airflow metrics, if it suits your need.
However pulling this information from database is not efficient, in any case but it still is an option if you are not looking for real time data collection.

Is it possible to force mark success any task in a DAG after certain time interval programmatically in Airflow?

I have a DAG with one task that fetches the data from the API. I want that task to fetch the data only for certain time interval and marks itself as SUCCESS so that the tasks after that starts running.
Please note that the tasks below are dependent on the tasks which I want to mark SUCCESS. I know I can mark the task SUCCESS manually from CLI or UI but I want to do it automatically.
Is it possible to do that programmatically using python in Airflow?
You can set status of task using python code, like this:
def set_task_status(**kwargs):
execution_date = kwargs['execution_date']
ti = TaskInstance(HiveOperatorTest, execution_date)
ti.set_state(State.SUCCESS)

Capture failed tasks from Celery workers during task execution

I am currently using Task.delay() in celery (with RabbitMQ) to perform asynchronous tasks. The tasks are distributed to 8 workers, and in these tasks, I am performing a database insertion operation.
Sometimes a tasks will fail, and the database insertion does not happen. This can be due to a timeout or JSON decode errors. I want to capture that an execution has failed.
Here is the relevant code:
views.py
def celery_url_filtering(request):
for each_data in dataset:
#each_data is a Json object
res = result.delay(each_data)
while(res.status == 'PENDING'):
pass
return JsonResponse({'ok':'Success'})
tasks.py
#app.task
def result(dataeach_data):
# Parse each_data and do data insertion here
return "Something"
How can I capture the failed executions in a list?
From what I can understand from your concern is that you want to look into the "PENDING" or "FAILURE" tasks and to retry/apply some application logic.
If that is so, you can have a cron running at fixed schedule, daily/hourly etc, based on requirement. This cron job can capture the tasks that failed in last day/hour based on the schedule you have.
You can use django-celery-beat for setting the cron job and django-celery-results to store celery task results using Django ORM.
For example, you can have a celery task like this
from celery import shared_task
from django_celery_results.models import TaskResult
**tasks.py**
#shared_task(name="failed_task_cron", bind=True)
def failed_task_cron(self, **kwargs):
"""
Celery task to run on daily schedule to do something with the failed tasks
"""
tasks = TaskResult.objects.filter(status='FAILURE')
# tasks is a queryset of all the failed tasks
# Perform application logic
You can set cron for the above task like this in your celery settings
from celery.schedules import crontab
# ...
CELERY_BEAT_SCHEDULE = {
"failed_task_cron": {
"task": "path/to/tasks.failed_task_cron",
"schedule": crontab(hour=1) # Runs every hour
}
}

How to get jobId that was submitted using Dataproc Workflow Template

I have submitted a Hive job using Dataproc Workflow Template with the help of Airflow operator (DataprocWorkflowTemplateInstantiateInlineOperator) written in Python. Once the job is submitted some name will be assigned as jobId (example: job0-abc2def65gh12).
Since I was not able to get jobId I tried to pass jobId as a parameter from REST API which isn't working.
Can I fetch jobId or, if it's not possible, can I pass jobId as a parameter?
The JobId will be available as part of metadata field in Operation object that is returned from Instantiate operation. See this [1] article for how to work with metadata.
The Airflow operator only polls [2] on the Operation but does not return the final Operation object. You could try to add a return to execute.
Another option would to be to use dataproc rest API [3] after workflow finishes. Any labels assigned to the workflow itself will be propagated to clusters and jobs so you can do a list jobs call. For example the filter parameter could look like: filter = labels.my-label=12345
[1] https://cloud.google.com/dataproc/docs/concepts/workflows/debugging#using_workflowmetadata
[2] https://github.com/apache/airflow/blob/master/airflow/contrib/operators/dataproc_operator.py#L1376
[3] https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/list

Resources