Structure and reuse python with Apache Airflow 2.3.4 - python-3.x

I have to write some code for an Apache Airflow DAG and I have encountered something that I do not know. I want to reuse some existing code in python 3.x within the Apache Airflow environment.
What I would like to achieve with this question:
I have my dags folder in /home/'user'/airflow/dags
I have another repository with code stored in /home/sources. Here I have an init.py and a main function which ca be called with parameters and in this repository there are several functions that are called based on the parameters
How can I access most efficient the main.py of the code in /home/sources from dag using the PythonOperator?
Thank you

If you are using a conda environment, you can simply add "/home/sources" to the PYTHONPATH.
Let's say your environment name is "airflow":
conda activate airflow
conda develop /home/sources
Now, along with airflow, your python will check everything under "/sources" folder. In your dag files you can simply use:
from main import my_functions
And give them to Python Operators.
For more information, check here.

Related

How to do integration testing for a dockized python files using pytest?

I have given a Machine Learning framework developed in python which is dockerized. I need to do integration testing of each python file inside the docker environment using pytest. How to proceed with this.
I tried creating a docker image for the test case python file and running it but got an error response not reachable. Also while importing pytest, docker_compose error as not available.
Any solution for this problem please suggest a solution for it. Thank You.

How can I use modules in Azure ML studio designer pipeline?

I am currently using a python script in my Azure pipeline
Import data as Dataframe --> Run Python Script --> Export Dataframe
My script is developed locally and I get import errors when trying to import tensorflow... No problem, guess I just have to add it to environment dependencies somewhere -- and it is here the documentation fails me. They seem to rely on the SDK without touching the GUI, but I am using the designer.
I have at this point already build some enviroments with the dependencies, but utilizing these environments on the run or script level is not obvious to me.
It seems trivial, so any help as to use modules is greatly appreciated.
To use the modules that are not preinstalled(see Preinstalled Python packages). You need to add the zipped file containing new Python packages on Script bundle. See below description in the document:
To include new Python packages or code, connect the zipped file that contains these custom resources to Script bundle port. Or if your script is larger than 16 KB, use the Script Bundle port to avoid errors like CommandLine exceeds the limit of 16597 characters.
Bundle the script and other custom resources to a zip file.
Upload the zip file as a File Dataset to the studio.
Drag the dataset module from the Datasets list in the left module pane in the designer authoring page.
Connect the dataset module to the Script Bundle port of Execute Python Script module.
Please check out document How to configure Execute Python Script.
For more information about how to prepare and upload these resources, see Unpack Zipped Data
You can also check out this similar thread.

Cannot configure a GCP project when using DataProcPySparkOperator

I am using a Cloud Composer environment to run workflows in a GCP project. One of my workflows creates a Dataproc cluster in different project using the DataprocClusterCreateOperator, and then attempts to submit a PySpark job to that cluster using the DataProcPySparkOperator from the airflow.contrib.operators.dataproc_operator module.
To create the cluster, I can specify a project_id parameter to create it in another project, but it seems like DataProcPySparkOperator ignores this parameter. For example, I expect to be able to pass a project_id, but I end up with a 404 error when the task runs:
from airflow.contrib.operators.dataproc_operator import DataProcPySparkOperator
t1 = DataProcPySparkOperator(
project_id='my-gcp-project',
main='...',
arguments=[...],
)
How can I use DataProcPySparkOperator to submit a job in another project?
The DataProcPySparkOperator from the airflow.contrib.operators.dataproc_operator module doesn't accept a project_id kwarg in its constructor, so it will always default to submitting Dataproc jobs in the project the Cloud Composer environment is in. If an argument is passed, then it is ignored, which results in a 404 error when running the task, because the operator will try to poll for a job using an incorrect cluster path.
One workaround is to copy the operator and hook, and modify it to accept a project ID. However, an easier solution is to use the newer operators from the airflow.providers packages if you are using a version of Airflow that supports them, because many airflow.contrib operators are deprecated in newer Airflow releases.
Below is an example. Note that there is a newer DataprocSubmitPySparkJobOperator in this module, but it is deprecated in favor of DataprocSubmitJobOperator. So, you should use the latter, which accepts a project ID.
from airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperator
t1 = DataprocSubmitJobOperator(
project_id='my-gcp-project-id',
location='us-central1',
job={...},
)
If you are running an environment with Composer 1.10.5+, Airflow version 1.10.6+, and Python 3, the providers are preinstalled and can be used immediately.

How to run Spark processes in develop environment using a cluster?

I'm implementing differents Apache Spark solutions using IntelliJ IDEA, Scala and SBT, however, each time that I want to run my implementation I need to do the next steps after creating the jar:
Amazon: To send the .jar to the master node using SSH, and then run
the command line spark-shell.
Azure: I'm using Databricks CLI, so each time that I want to upload a
jar, I uninstall the old library, remove the jar stored in the cluster,
and finally, I upload and install the new .jar.
So I was wondering if it is possible to do all these processes just in one click, using the IntelliJ IDEA RUN button for example, or using another method to make simpler all of it. Also, I was thinking about Jenkins as an alternative.
Basically, I'm looking for easier deployment options.

SageMaker Script Mode Serving

I've trained a tensorflow.keras model using SageMaker Script Mode like this:
import os
import sagemaker
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(entry_point='train.py',
source_dir='src',
train_instance_type=train_instance_type,
train_instance_count=1,
hyperparameters=hyperparameters,
role=sagemaker.get_execution_role(),
framework_version='1.12.0',
py_version='py3',
script_mode=True)
However, how do I specify what the serving code is when I call estimator.deploy()? And what is it by default? Also is there any way to modify the nginx.conf using Script Mode?
The Tensorflow container is open source: https://github.com/aws/sagemaker-tensorflow-container You can view exactly how it works. Of course, you can tweak it, build it locally, push it to ECR and use it on SageMaker :)
Generally, you can deploy in two ways:
Python-based endpoints: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_python.rst
TensorFlow Serving endpoints: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst
I would also recommend looking at the TensorFlow examples here: https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk
With script mode the default serving method is the TensorFlow Serving-based one:
https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/estimator.py#L393
Custom script is not allowed with the TFS based container. You can use serving_input_receiver_fn to specify how the input data is processed as described here: https://www.tensorflow.org/guide/saved_model
As for modifying the ngnix.conf, there are no supported ways of doing that. Depends on what you want to change in the config file you can hack the sagemaker-python-sdk to pass in different values for these environment variables: https://github.com/aws/sagemaker-tensorflow-serving-container/blob/3fd736aac4b0d97df5edaea48d37c49a1688ad6e/container/sagemaker/serve.py#L29
Here is where you can override the environment variables: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/serving.py#L130

Resources