azure pyspark from modules import myfunctions; No module name - azure

I have tried a number of methods to import a local script containing an bunch of shared functions from our shared team directory with example in the code below. I also tried "from . import sharedFunctions" with the importing script in the same directory and "from sharedModules import sharedFunctions" from the parent directory. All of these return No module named 'sharedFunctions' based on some google searches. What is the best way to set this up in Azure?
Thanks
import sys, os
dir_path = '/Shared/XXX/sharedModules'
sys.path.insert(0, dir_path)
print(sys.path)
# dir_path = os.path.dirname(os.path.realpath(__file__))
# sys.path.insert(0, dir_path)
import sharedFunctions
sourceTable='dnb_raw'
sourceQuery='select DUNSNumber , GlobalUltimate_Name, BusinessName from'
sourceId = 'DUNSNumber'
sourceNameList=['Tradestyle','BusinessName']
NewTable = 'default.' + sourceTable + '_enhanced'
#dbutils.fs.rm("dbfs:/" + NewTable + "/",recurse=True)
clean_names(sourceTable,sourceQuery,sourceId,sourceNameList)

when you're working with notebooks in Databricks, they are not on some file system that is understood by Python as module.
If you want to include another notebook with some other definitions into the current context, you can use %run magic command, passing the name of another notebook as an argument:
%run /Shared/XXX/sharedModules/sharedFunctions
But the %run is not the full substitution for imports, as described in the documentation
You cannot use %run to run a Python file and import the entities defined in that file into a notebook. To import from a Python file you must package the file into a Python library, create a Databricks library from that Python library, and install the library into the cluster you use to run your notebook.
If you want to execute another notebook to get some results from it, you can use so-called notebook workflow - when exeucting via dbutils.notebook.run, the notebook is scheduled for execution, you can pass some parameters to it, etc., but results will be shared mostly via file system, managed table, etc.

Related

How to import library from a given path in Azure Databricks?

I want to import a standard library (of a given version) in a Databricks Notebook Job. I do not want to install the library everytime a job cluster is created for this job. I want to install the library in a DBFS location and import the library directly from that location (by changing sys.path or something similar).
This is working in local:
I installed a library in a given path using:
pip install --target=customLocation library=major.minor
Append custom Location to sys.path variable:
sys.path.insert(0, 'customLocation')
When I import the library and check the location, I get the expected response
import library
print(library.__file__)
#Output - customLocation\library\__init__.py
However, the same exact sequence does not work in Databricks:
I installed the library in DBFS location:
%pip install --target='dbfs/customFolder/' numpy==1.19.4
Append sys.path variable:
sys.path.insert(0, 'dbfs/customFolder/')
Tried to find numpy version and file location:
import numpy
print(numpy.__version__)
print(numpy.__file__)
#Output - 1.21.4 (Databricks Runtime 10.4 Default Numpy)
/dbfs/customFolder/numpy/__init__.py
The customFolder has version 1.19.4 and the imported numpy shows that location, but version number does not match?
How exactly is the imports working in databricks to create this behaviour?
I also tried importing using the importlib function and the result remains the same. Link - https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly
import importlib.util
import sys
spec = importlib.util.spec_from_file_location('numpy', '/dbfs/customFolder/')
module = importlib.util.module_from_spec(spec)
sys.modules['numpy'] = module
spec.loader.exec_module(module)
import numpy
print(numpy.__version__)
print(numpy.__file__)
Related:
How to install Python Libraries in a DBFS Location and access them through Job Clusters (without installing it for every job)?

How does one import the contents of a text or configuration file innately in a project?

I tried the following in an __init__.py file, thinking that it would be evaluated according to its location at the time of import:
# /../../proj_dir_foo/__init__.py
# opens file: .../proj_dir_foo/foo.csv
import pandas as pd
settings = pd.read_csv('foo.csv')
And from a different file:
from foo.bar.proj_dir_foo import settings
Yields: FileNotFoundError
But this is not really convenient. Instead of accumulating configuration files that are much easier to modify, I am accumulating source code in proj_dir_foo which stores configuration info.
The sole reason it is in source code is because having a project module that knows where the root's resources or materials folder full of configs is is not technically a "module". Instead, it is an integrated cog in a machine. Or, rather, a thing I can no longer easily refactor.
How does one modularize any arbitrary configuration file in python project?
Your script's current directory is the directory from which you started it. import os; print(os.getcwd()) will show you that.
If you want to open a file what sits in a place relative to your code, you have several options.
Use sys.argv[0] to get the path to your script; Use path.dirname() to extract the directory from it, and path.join() make a path to a particular file:
# some script.
import json, sys, path
my_path = path.dirname(sys.argv[0])
cfg_path = path.join(my_path, 'config', 'settings.json')
with open(my_path) as cfg_file:
my_settings = json.load(cfg_file)
Alternatively, if you import a module, you can use its __file__ attribute to learn where did you import it from, and use to locate a config:
# some script.
import path
import foo.bar.baz
baz_cfg_path = path.join(path.dirname(foo.bar.baz.__file__), 'baz.cfg')

How to use/install python code/file in Juypter notebooks

I got code file as data_load_util.py from Git hub. I'm following some tutorial where this import is being used. Using Python 3.x and Juypter Notebooks with connection to SAP Hana 2.0 Express Edition.
File location - https://github.com/SAP-samples/hana-ml-samples/blob/master/Python-API/pal/notebooks/data_load_utils.py
Command I'm using for tutorial:
from hana_ml import dataframe
from data_load_utils import DataSets, Settings
Error I'm getting:
ModuleNotFoundError: No module named 'data_load_utils'
Since I found this utility data_load_util.py as code file but not sure how I use this or attach this to python or juypter notebooks so that I can use code and this error will be gone.
Help will be appreciated.
Link to error screen shot
You need to tell Jupyter where to look for modules via sys.path.
From this doc, you can add your module’s sub-directory to Python's path like this:
import os
import sys
sys.path.insert(0, os.path.abspath('../module-subdirectory'))
Then you can simply import it:
from data_load_utils import DataSets, Settings
Note: Here module-subdirectory is the sub-directory that has got data_load_util.py.
For alternate methods, please refer this doc.

Setting pythonpath on Jupyter notebook

I want to add a permanent PYTHONPATH using Jupyter Notebook to be able to access the data from a particular directory or folder. I have read that we could use JUPYTER_PATH for it.
Could someone tell me a step wise instruction on how to do it. I am new to it and the documentation was not very clear.
For example sake lets say my path is as follows:
C:\ENG\Fin_trade\ION
For a script that needs to reference your directory, you can do the following.
Say you have the file foo.py containing the class Foo in your directory C:\ENG\Fin_trade\ION
import os
import sys
new_path = r'C:\ENG\Fin_trade\ION'
sys.path.append(new_path)
import foo
Foo()

__file__ does not exist in Jupyter Notebook

I'm on a Jupyter Notebook server (v4.2.2) with Python 3.4.2 and
I want to use the global name __file__, because the notebook will be cloned from other users and in one section I have to run:
def __init__(self, trainingSamplesFolder='samples', maskFolder='masks'):
self.trainingSamplesFolder = self.__getAbsPath(trainingSamplesFolder)
self.maskFolder = self.__getAbsPath(maskFolder)
def __getAbsPath(self, path):
if os.path.isabs(path):
return path
else:
return os.path.join(os.path.dirname(__file__), path)
The __getAbsPath(self, path) checks if a path param is a relative or absolute path and returns the absolute version of path. So I can use the returned path safely later.
But I get the error
NameError: name '__file__' is not defined
I searched for this error online and found the "solution" that I should better use sys.argv[0], but print(sys.argv[0]) returns
/usr/local/lib/python3.4/dist-packages/ipykernel/__main__.py
But the correct notebook location should be /home/ubuntu/notebooks/.
Thanks for the reference How do I get the current IPython Notebook name from Martijn Pieters (comments) the last answer (not accepted) fits perfect for my needs:
print(os.getcwd())
/home/ubuntu/notebooks
If you want to get path of the directory in which your script is running, I would highly recommend using,
os.path.abspath('')
Advantages
It works from Jupyter Notebook
It work from REPL
It doesn't require Python 3.4's pathlib
Please note that one scenario where __file__ has advantage is when you are invoking python from directory A but running script in directory B. In that case above as well as most other methods will return A, not B. However for Jupyter notbook, you always get folder for .ipyn file instead of the directory from where you launched jupyter notebook.
__file__ might not be available for you, but you can get current folder in which your notebook is located in different way, actually.
There are traces in global variables, if you will call globals() you will see that there is an element with the key _dh, that might help you. Here how I managed to load the data.csv file that is located in the same folder as my notebook:
import os
current_folder = globals()['_dh'][0]
# Calculating path to the input data
data_location = os.path.join(current_folder,'data.csv')
In modern Python (v3.4+) we can use pathlib to get the notebook's directory:
from pathlib import Path
cwd = Path().resolve()
# cwd == PosixPath('/path/to/this/jupyter/ipynb/file's/directory/')
# or this way, thanks #NunoAndré:
cwd = Path.cwd()
# cwd == PosixPath('/path/to/this/jupyter/ipynb/file's/directory/')
Update
#ShitalShah I cannot reproduce the error you are reporting. Jupyter notebook seems to work fine, regardless of the current working directory the application was started.
Example: file ~/dir1/dir2/untitled.ipynb and Jupyter notebook started in ~/dir1:
Jupyter notebook started in ~/dir1/dir2:
It's not possible to get the path to the notebook. You may find a way to get it that only works in one environment (eg os.getcwd()), but it won't necessarily work if the notebook is loaded in a different way.
Instead, try to write the notebook so that it doesn't need to know its own path. If doing something like getting the pwd, then be sure to fail fast / print an error if this doesn't work, vs. just silently trying to continue on.
See also: https://github.com/ipython/ipython/issues/10123
I'm new to python, but this works for me.
You can get os.path.dirname(__file__) equivalence by:
sys.path[0]
On new version of python and notebook __file__ works... If you have older versions, to get __file__ you can use
import inspect
from pathlib import Path
module_path = Path(inspect.getframeinfo(inspect.currentframe()).filename).resolve()
But... This is much slower... On the other hand, it will not return currrent working directory, but path to module, even if module is imported from elsewhere.

Resources