How to get jobId that was submitted using Dataproc Workflow Template - python-3.x

I have submitted a Hive job using Dataproc Workflow Template with the help of Airflow operator (DataprocWorkflowTemplateInstantiateInlineOperator) written in Python. Once the job is submitted some name will be assigned as jobId (example: job0-abc2def65gh12).
Since I was not able to get jobId I tried to pass jobId as a parameter from REST API which isn't working.
Can I fetch jobId or, if it's not possible, can I pass jobId as a parameter?

The JobId will be available as part of metadata field in Operation object that is returned from Instantiate operation. See this [1] article for how to work with metadata.
The Airflow operator only polls [2] on the Operation but does not return the final Operation object. You could try to add a return to execute.
Another option would to be to use dataproc rest API [3] after workflow finishes. Any labels assigned to the workflow itself will be propagated to clusters and jobs so you can do a list jobs call. For example the filter parameter could look like: filter = labels.my-label=12345
[1] https://cloud.google.com/dataproc/docs/concepts/workflows/debugging#using_workflowmetadata
[2] https://github.com/apache/airflow/blob/master/airflow/contrib/operators/dataproc_operator.py#L1376
[3] https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/list

Related

Can we set task wise parameters using Databricks Jobs API "run-now"

I have a job with multiple tasks like Task1 -> Task2. I am trying to call the job using api "run now". Task details are below
Task1 - It executes a Note Book with some input parameters
Task2 - It executes a Note Book with some input parameters
So, how I can provide parameters to job api using "run now" command for task1,task2?
I have a parameter "lib" which needs to have values 'pandas' and 'spark' task wise.
I know that we can give unique parameter names like Task1_lib, Task2_lib and read that way.
current way:
json = {"job_id" : 3234234, "notebook_params":{Task1_lib: a, Task2_lib: b}}
Is there a way to send task wise parameters?
It's not supported right now - parameters are defined on the job level. You can ask your Databricks representative (if you have) to communicate this ask to the product team who works on the Databricks Workflows.

Job_run_id for Databricks jobs that only contain 1 task

It seems like the task_run_id becomes the run_id for Databricks jobs (Azure) that only have 1 task, whereas the job_run_id becomes the run_id for Databricks jobs that have 2 or more tasks. My question is: Where can I find the job_run_id for Databricks jobs (Azure) that only contain 1 task?
The task_run_id might simply be an indication that a particular job contains only single task. Either it is the job_run_id or the task_run_id, both indicate the run_id for a job run.
You can verify this using the jobs run list that can be obtained using Jobs API. Use the following code:
import requests
import json
auth = {"Authorization": "Bearer <databricks_access_token>"}
response = requests.get('https://<databricks_workspace_url>/api/2.0/jobs/runs/list', headers=auth).json()
print(response)
You can navigate to your job run and check that the job's run id and the task_run_id are same:

Airflow dags - reporting of the runtime for tracking purposes

I am trying to find a way to capture the dag stats - i.e run time (start time, end time), status, dag id, task id, etc for various dags and their task in a separate table
found the default logs which goes to elasticsearch/kibana, but not a simple way to pull the required logs from there back to the s3 table.
building a separate process to load those logs in s3 will have replicated data and also there will be too much data to scan and filter as tons of other system-related logs are generated as well.
adding a function to each dag - would have to modify each dag
what other possibilities are to get it don't efficiently, of any other airflow inbuilt feature can be used
You can try using Ad Hoc Query available in Apache airflow.
This option is available at Data Profiling -> Ad Hoc Query and select airflow_db
If you wish to get DAG statistics such as start_time,end_time etc you can simply query in the below format
select start_date,end_date from dag_run where dag_id = 'your_dag_name'
The above query returns start_time and end_time details of the DAG for all the DAG runs. If you wish to get details for a particular run then you can add another filter condition like below
select start_date,end_date from dag_run where dag_id = 'your_dag_name' and execution_date = '2021-01-01 09:12:59.0000' ##this is a sample time
You can get this execution_date from tree or graph views. Also you can get other stats like id,dag_id,execution_date,state,run_id,conf as well.
You can also refer to https://airflow.apache.org/docs/apache-airflow/1.10.1/profiling.html#:~:text=Part%20of%20being%20productive%20with,application%20letting%20you%20visualize%20data. link for more details.
You did not mention do you need this information real time or in batches.
Since you do not want to use ES logs either, you can try airflow metrics, if it suits your need.
However pulling this information from database is not efficient, in any case but it still is an option if you are not looking for real time data collection.

Is it possible to assign job names to separate workers in a SLURM array via sbatch?

By default, when submitting a SLURM job as an array, all jobs within the array share the same job name. In the docs (here: https://slurm.schedmd.com/job_array.html), it shows that each job in the array can have its name set separately via scontrol (described under the section "Scontrol Command Use").
Can this be done directly from an sbatch script?
I just created an account because I was trying to do this and I did find a solution.
You can use scontrol to change the name of a job, the syntax is the following:
scontrol update job=<job_id> JobName=<new_name>
You can do this manually, but you can also automatically set the name of the job from within the array job, thus automatically assigning a different name to each job in the array.
I find this useful because I'm mostly running calculations in different directories and if I have one job running much longer than the others I want to be able to quickly retrieve where it's running to see what's going on.
Of course you could set other things as your job name, as you prefer.
In my case, I add the scontrol command to the script I run through the array in order to obtain the following name for each directory: "job_name - directory". The job id and job name can be retrieved from environment variables.
scontrol update job=$SLURM_ARRAY_JOB_ID JobName="$SLURM_JOB_NAME - $folder"

Workflow System with Azure Table Storage

I have a system where we need to run a simple workflow.
Example:
On Jan 1st 08:15 trigger task A for object Z
When triggered then run some code (implementation details not important)
Schedule task B for object Z to run at Jan 3rd 10:25 (and so on)
The workflow itself is simple, but I need to run 500.000+ instances and that's the tricky part.
I know Windows Workflow Foundation and for that very same reason I have chosen not to use that.
My initial design would be to use Azure Table Storage and I would really appreciate some feedback on the design.
The system will consist of two tables
Table "Jobs"
PartitionKey: ObjectId
Rowkey: ProcessOn (UTC Ticks in reverse so that newest are on top)
Attributes: State (Pending, Processed, Error, Skipped), etc...
Table "Timetable"
PartitionKey: YYYYMMDD
Rowkey: YYYYMMDDHHMM_<GUID>
Attributes: Job_PartitionKey, Job_RowKey
The idea is that the runs table will have the complete history of jobs per object and the Timetable will have a list of all jobs to run in the future.
Some assumptions:
A job will never span more than one Object
There will only ever be one pending job per Object
The "job" is very lightweight e.g. posting a message to a queue
The system must be able to perform these tasks:
Execute pending jobs
Query for all records in "Timetable" with a "partition <= Today" and "RowKey <= today"
For each record (in parallel)
Lookup job in Jobs table via PartitionKey and RowKey
If "not exists" or State != Pending then skip
Execute "logic". If fails => log and maybe do some retry logic
Submit "Next run date in Timetable"
Submit "Update State = Processed" and "New Job Record (next run)" as a single transaction
When all are finished => Delete all processed Timetable records
Concern: Only two of the three records modifications are in a transaction. Could this be overcome in any way?
Stop workflow
Stop/pause workflow for Object Z
Query top 1 jobs in Jobs table by PartitionKey
If any AND State == Pending then update to "Cancelled"
(No need to bother cleaning Timetable it will clean itself up "when time comes")
Start workflow
Create Pending record in Jobs table
Create record in Timetable
In terms of "executing the thing" I would
be using a Azure Function or Scheduler-thing to execute the pending jobs every 5 minutes or so.
Any comments or suggestions would be highly appreciated.
Thanks!
How about using Service Bus instead? The BrokeredMessage class has a property called ScheduledEnqueueTimeUtc. You can just schedule when you want your jobs to run via the ScheduledEnqueueTimeUtc property, and then fuggedabouddit. You can then have a triggered webjob that monitors the Service Bus messaging queue, and will be triggered very near when the job message is enqueued. I'm a big fan of relying on existing services to minimize the coding needed.

Resources