Why are my custom operators not being imported into my DAG (Airflow)? - python-3.x

I am creating an ETL pipeline using Apache Airflow and I am trying to create generalized custom operators. There seems to be no problem with the operators but they are not being imported into my DAG python file.
This is my directory structure.
my_project\
.env
Pipfile
Pipfile.lock
.gitignore
.venv\
airflow\
dags\
logs\
plugins\
__init__.py
helpers\
operators\
__init__.py
data_quality.py
load_fact.py
load_dimension.py
stage_redshift
This is what is present in the __init__.py file under plugins folder.
from __future__ import division, absolute_import, print_function
from airflow.plugins_manager import AirflowPlugin
import airflow.plugins.operators as operators
import airflow.plugins.helpers as helpers
# Defining the plugin class
class SparkifyPlugin(AirflowPlugin):
name = "sparkify_plugin"
operators = [
operators.StageToRedshiftOperator,
operators.LoadFactOperator,
operators.LoadDimensionOperator,
operators.DataQualityOperator
]
helpers = [
helpers.SqlQueries
]
I'm importing these operators into my DAG file as following
from airflow.operators.sparkify_plugin import (StageToRedshiftOperator,
LoadFactOperator,
LoadDimensionOperator,
DataQualityOperator)
I am getting an error as follows
ERROR - Failed to import plugin /Users/user_name/Documents/My_Mac/Projects/sparkify_etl_sql_to_sql/airflow/plugins/operators/stage_redshift.py
Can you help me understand why this is happening?

I figured out how to register my custom operators with Airflow without dedicating a python file to use AirflowPlugin class.
I achieved this by declaring them in my __init__.py file under plugins directory.
This is how I did it.
My project folder structure is as follows
my_project\
.env
Pipfile
Pipfile.lock
.gitignore
.venv\
airflow\
dags\
logs\
plugins\
__init__.py
helpers\
operators\
__init__.py
data_quality.py
load_fact.py
load_dimension.py
stage_redshift
My code in plugins/__init__.py
from airflow.plugins_manager import AirflowPlugin
import operators
import helpers
# Defining the plugin class
class SparkifyPlugin(AirflowPlugin):
name = "sparkify_plugin"
operators = [
operators.StageToRedshiftOperator,
operators.LoadFactOperator,
operators.LoadDimensionOperator,
operators.DataQualityOperator
]
helpers = [
helpers.SqlQueries
]
My code in plugins/operators/__init__.py
from operators.stage_redshift import StageToRedshiftOperator
from operators.load_fact import LoadFactOperator
from operators.load_dimension import LoadDimensionOperator
from operators.data_quality import DataQualityOperator
__all__ = [
'StageToRedshiftOperator',
'LoadFactOperator',
'LoadDimensionOperator',
'DataQualityOperator'
]
I am importing these custom operators in my dag file(dags/etl.py) as:
from airflow.operators.spark_plugin import LoadDimensionOperator
spark_plugin is what the name attribute in SparkifyPlugin class (stored in plugins/__init__.py) holds.
Airflow automatically registers these custom operators.
Hope it helps someone else in the future.
In case you are having some import errors, try running python __init__.py for each module as described by #absolutelydevastated. Make sure that the one in plugins directory runs without throwing errors.
I used Pycharm and it did throw me a few errors when running __init__.py files in the plugins/operators directory.
Fixing the one in plugins directory and ignoring the errors thrown by plugins/operators/__init__.py fixed my issue.

If you check out: Writing and importing custom plugins in Airflow
The person there was having a similar problem with their plugin, which they fixed by including a file under airflow/plugins named for their plugin, rather than defining it in the __init__.py file.

Related

Python import from parent directory for dockerize structure

I have a project with two applications. They both use a mongo-engine database model file. Also they have to start in different Docker containers, but use the same Mongo database in the fird container. Now my app structure looks like this:
app_root/
app1/
database/
models.py
main.py
app2/
database/
models.py
main.py
And it works fine, BUT I have to support two same files database/models.py. I dont want to do this and I make the next structure:
app_root/
shared/
database/
models.py
app1/
main.py
app2/
main.py
Unfortunately it doesnt work for me, because when I try this in my main.py:
from ..shared.database.models import *
I get
Exception has occurred: ImportError
attempted relative import with no known parent package
And when I try
from app_root.shared.database.models import *
I get
Exception has occurred: ModuleNotFoundError No module named 'app_root'
Please, what do I do wrong?
In the file you perform the import, try adding this:
import os
import sys
sys.path.append(os.path.abspath('../../..'))
from app_root.shared.database.models import *

Python import files from 3 layers

I have the following file structure
home/user/app.py
home/user/content/resource.py
home/user/content/call1.py
home/user/content/call2.py
I have imported resources.py in app.py as below:
import content.resource
Also, I have imported call1 and call2 in resource.py
import call1
import call2
The requirement is to run two tests individually.
run app.py
run resource.py
When I run app.py, it says cannot find call1 and call2.
When run resource.py, the file is running without any issues. How to run app.py python file to call import functions in resource.py and also call1.py and call2.py files?
All the 4 files having __init__ main function.
In your __init__ files, just create a list like this for each init, so for your user __init__: __all__ = ["app", "content"]
And for your content __init__: __all__ = ["resource", "call1", "call2"]
First try: export PYTHONPATH=/home/user<-- Make sure this is the correct absolute path.
If that doesn't solve the issue, try adding content to the path as well.
try: export PYTHONPATH=/home/user/:/home/user/content/
This should definitely work.
You will then import like so:
import user.app
import user.content.resource
NOTE
Whatever you want to use, you must import in every file. Don't bother importing in __init__. Just mention whatever modules that __init__ includes by doing __all__ = []
You have to import call1 and call2 in app.py if you want to call them there.

Python 3: Importing Files / Modules from Scattered Directories and Files

I have the following directory structure:
/home/pi
- project/
- p1v1.py
- tools1/
- __init__.py
- tools1a/
- __init__.py
- file1.py
- file2.py
- tools1a1/
- __init__.py
- file3.py
- file4.py
- tools1a2/
- __init__.py
- file5.py
- file6.py
I am trying to import all the modules from the file1.py into my project file p1v1.py
from file1 import *
but end up with either an
ImportError: attempted relative import with no known parent package
or an
ValueError: Attempted relative import in non-package
depending on what I use in p1v1.py because the functions in file1.py depend on file3.py and file4.py. I would like to use explicit imports (for clarity), but I'm not sure how to do this. Any advice would be appreciated.
Thank you!
Through trial and error eventually figured out how to solve this:
import sys
sys.path.insert(0,'..')
from tools1.tools1a.file1 import function as f1
Note: For this to work, I needed to be editing and executing my script p1v1.py out of the working directory /home/pi/project/. Hope this helps others with a similar problem!

Python 3.6 Importing a class from a parallel folder

I have a file structure as shown below,
MainFolder
__init__.py
FirstFolder
__init__.py
firstFile.py
SecondFolder
__init__.py
secondFile.py
Inside firstFile.py, I have a class named Math and I want to import this class in secondFile.py.
Code for firstFile.py
class Math(object):
def __init__(self, first_value, second_value):
self.first_value = first_value
self.second_value = second_value
def addition(self):
self.total_add_value = self.first_value + self.second_value
print(self.total_add_value)
def subtraction(self):
self.total_sub_value = self.first_value - self.second_value
print(self.total_sub_value)
Code for secondFile.py
from FirstFolder.firstFile import Math
Math(10, 2).addition()
Math(10, 2).subtraction()
When I tried running secondFile.py I get this error: ModuleNotFoundError: No module named 'First'
I am using Windows and the MainFolder is located in my C drive, under C:\Users\Name\Documents\Python\MainFolder
Possible solutions that I have tried are, creating the empty __init__.py for all main and sub folders, adding the dir of MainFolder into path under System Properties environment variable and using import sys & sys.path.append('\Users\Name\Documents\Python\MainFolder').
Unfortunately, all these solutions that I have found are not working. If anyone can highlight my mistakes to me or suggest other solutions, that would be great. Any help will be greatly appreciated!
There are potentially two issues. The first is with your import statement. The import statement should be
from FirstFolder.firstFile import Math
The second is likely that your PYTHONPATH environment variable doesn't include your MainFolder.
On linux and unix based systems you can do this temporarily on the commandline with
export PYTHONPATH=$PYTHONPATH:/path/to/MainFolder
On windows
set PYTHONPATH="%path%;C:\path\to\MainFolder"
If you want to set it permanently, use setx instead of set

Loading python modules in Python 3

How do I load a python module, that is not built in. I'm trying to create a plugin system for a small project im working on. How do I load those "plugins" into python? And, instaed of calling "import module", use a string to reference the module.
Have a look at importlib
Option 1: Import an arbitrary file in an arbiatrary path
Assume there's a module at /path/to/my/custom/module.py containing the following contents:
# /path/to/my/custom/module.py
test_var = 'hello'
def test_func():
print(test_var)
We can import this module using the following code:
import importlib.machinery
myfile = '/path/to/my/custom/module.py'
sfl = importlib.machinery.SourceFileLoader('mymod', myfile)
mymod = sfl.load_module()
The module is imported and assigned to the variable mymod. We can then access the module's contents as:
mymod.test_var
# prints 'hello' to the console
mymod.test_func()
# also prints 'hello' to the console
Option 2: Import a module from a package
Use importlib.import_module
For example, if you want to import settings from a settings.py file in your application root folder, you could use
_settings = importlib.import_module('settings')
The popular task queue package Celery uses this a lot, rather than giving you code examples here, please check out their git repository

Resources