The datetime in airflow Task instance keeps on updating itself

The datetime in airflow Task instance keeps on updating itself - python-3.x

I created a trigger_date_time from a datetime object like this
trigger_date_time = str(datetime.now(tz=timezone.utc)).replace(' ', '_')
I use this variable in my airflow DAG definition. I noticed that its value keeps on updating as the task instance is being executed. I cannot comprehend how it works and how to solve this issue. I simply want to have a trigger_date_time holding the value when I trigger the DAG execution throughout the execution.

Airflow is parsing the dag every 30 sec (by default) and thats why datetime always returns a value to your trigger_date_time variable.
to get the execution date you need to get the value from the dag_run itself.
you didn't share the full code, so its hard to understand the context here, but if you are in the dag it self you can use Jinja template, for example {{ ts }}.
if you are inside a PythonOperator then you can use context['execution_date']

Related

Exception: "SparkContext should only be created and accessed on the driver" while trying foreach()

Being new to Spark, I need to read data from MySQL DB, and then update(or upsert) rows in another table based on what I've read.
AFAIK, unfortunately, there's no way I can do update with DataFrameWriter, so I want to try querying directly to the DB after/while iterating over partitions.
For now I'm writing a script and testing with local gluepyspark shell, Spark version 3.1.1-amzn-0.
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
def f(p):
pass
sc.parallelize([1, 2, 3, 4, 5]).foreachPartition(lambda p: f(p))
When I try to import this simple code in gluepyspark shell, it raises errors saying "SparkContext should only be created and accessed on the driver."
However, there are some conditions under which it works.
It works if I run the script via gluesparksubmit.
It works if I use lambda expression instead of function declaration.
It works if I declare a function within REPL and pass it as argument.
It does not work if I put both def func(): () and .foreachPartition(func) call in the same script.
Moving the function declaration to another module also seems to work. But this couldn't be an option for I need to pack things in one job script.
Could anyone please help me understand:
why the error is thrown
why the error is NOT thrown in other cases
Complete error log: https://justpaste.it/37tj6

Pass value from operator to dag

I have a BashOperator in my dag which runs a python script. At the end of its run, the python script creates a json as a report. In case of failure, I want to report this json to Slack. I have an on_failure_callback function which can push a message to Slack. However, I have no elegant way to pass the json value to the dag. Currently, I am saving the json to a file and then reading it from the file in the dag and reporting it. I also tried storing it in an environment variable and getting it in the dag. But is there a more direct way to pass this value to the dag? Preferably, without saving it to a file or an environment variable.

Rather than using on_failure_callback you should use another task to send your slack message.
And for that you should simply use XComs to push/pull the file. https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html#
Then you can pull the xcom in the task that sends it further to slack.

Execute multiple notebooks in parallel in pyspark databricks

Question is simple:
master_dim.py calls dim_1.py and dim_2.py to execute in parallel. Is this possible in databricks pyspark?
Below image is explaning what am trying to do, it errors for some reason, am i missing something here?

Just for others in case they are after how it worked:
from multiprocessing.pool import ThreadPool
pool = ThreadPool(5)
notebooks = ['dim_1', 'dim_2']
pool.map(lambda path: dbutils.notebook.run("/Test/Threading/"+path, timeout_seconds= 60, arguments={"input-data": path}),notebooks)

your problem is that you're passing only Test/ as first argument to the dbutils.notebook.run (the name of notebook to execute), but you don't have notebook with such name.
You need either modify list of paths from ['Threading/dim_1', 'Threading/dim_2'] to ['dim_1', 'dim_2'] and replace dbutils.notebook.run('Test/', ...) with dbutils.notebook.run(path, ...)
Or change dbutils.notebook.run('Test/', ...) to dbutils.notebook.run('/Test/' + path, ...)

Databricks now has workflows/multitask jobs. Your master_dim can call other jobs to execute in parallel after finishing/passing taskvalue parameters to dim_1, dim_2 etc.

How to supress warning: Datetime with no tzinfo will be considered UTC?

I have the following code which am using to monitor Azure ADF pipeline runs. The code uses 'RunFilterParameters' to apply a date range filter in extracting run results:
filter_params = RunFilterParameters(last_updated_after=datetime.now() - timedelta(1), last_updated_before=datetime.now() + timedelta(1))
query_response = adf_client.activity_runs.query_by_pipeline_run(resource_group, adf_name, row.latest_runid,filter_params)
The above works ok, however it is throwing a warning:
Datetime with no tzinfo will be considered UTC
Not sure how to add timezone to this or just suppress the warning?
Please help.

"no tzinfo" means that naive datetime is used, i.e. datetime with no defined time zone. Since Python assumes local time by default for naive datetime, this can cause unexpected behavior.
In the code example given in the question, you can create an aware datetime object (with tz specified to be UTC) like
from datetime import datetime, timezone
# and use
datetime.now(timezone.utc)
If you need to use another time zone than UTC, have a look at zoneinfo (Python 3.9+ standard library).

Parameter to notebook - widget not defined error

I have passed a parameter from JobScheduling to Databricks Notebook and tried capturing it inside the python Notebook using dbutils.widgets.get ().
When I run the scheduled job, I get the error "No Input Widget defined", thrown by the library module ""InputWidgetNotDefined"
May I know the reason, thanks.

To use widgets, you fist need to create them in the notebook.
For example, to create a text widget, run:
dbutils.widgets.text('date', '2021-01-11')
After the widget has been created once on a given cluster, it can be used to supply parameters to the notebook on subsequent job runs.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

The datetime in airflow Task instance keeps on updating itself - python-3.x

Related

Exception: "SparkContext should only be created and accessed on the driver" while trying foreach()

Pass value from operator to dag

Execute multiple notebooks in parallel in pyspark databricks

How to supress warning: Datetime with no tzinfo will be considered UTC?

Parameter to notebook - widget not defined error

Categories

Resources