How to import library from a given path in Azure Databricks?

How to import library from a given path in Azure Databricks? - databricks

I want to import a standard library (of a given version) in a Databricks Notebook Job. I do not want to install the library everytime a job cluster is created for this job. I want to install the library in a DBFS location and import the library directly from that location (by changing sys.path or something similar).
This is working in local:
I installed a library in a given path using:
pip install --target=customLocation library=major.minor
Append custom Location to sys.path variable:
sys.path.insert(0, 'customLocation')
When I import the library and check the location, I get the expected response
import library
print(library.__file__)
#Output - customLocation\library\__init__.py
However, the same exact sequence does not work in Databricks:
I installed the library in DBFS location:
%pip install --target='dbfs/customFolder/' numpy==1.19.4
Append sys.path variable:
sys.path.insert(0, 'dbfs/customFolder/')
Tried to find numpy version and file location:
import numpy
print(numpy.__version__)
print(numpy.__file__)
#Output - 1.21.4 (Databricks Runtime 10.4 Default Numpy)
/dbfs/customFolder/numpy/__init__.py
The customFolder has version 1.19.4 and the imported numpy shows that location, but version number does not match?
How exactly is the imports working in databricks to create this behaviour?
I also tried importing using the importlib function and the result remains the same. Link - https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly
import importlib.util
import sys
spec = importlib.util.spec_from_file_location('numpy', '/dbfs/customFolder/')
module = importlib.util.module_from_spec(spec)
sys.modules['numpy'] = module
spec.loader.exec_module(module)
import numpy
print(numpy.__version__)
print(numpy.__file__)
Related:
How to install Python Libraries in a DBFS Location and access them through Job Clusters (without installing it for every job)?

Related

azure pyspark from modules import myfunctions; No module name

I have tried a number of methods to import a local script containing an bunch of shared functions from our shared team directory with example in the code below. I also tried "from . import sharedFunctions" with the importing script in the same directory and "from sharedModules import sharedFunctions" from the parent directory. All of these return No module named 'sharedFunctions' based on some google searches. What is the best way to set this up in Azure?
Thanks
import sys, os
dir_path = '/Shared/XXX/sharedModules'
sys.path.insert(0, dir_path)
print(sys.path)
# dir_path = os.path.dirname(os.path.realpath(__file__))
# sys.path.insert(0, dir_path)
import sharedFunctions
sourceTable='dnb_raw'
sourceQuery='select DUNSNumber , GlobalUltimate_Name, BusinessName from'
sourceId = 'DUNSNumber'
sourceNameList=['Tradestyle','BusinessName']
NewTable = 'default.' + sourceTable + '_enhanced'
#dbutils.fs.rm("dbfs:/" + NewTable + "/",recurse=True)
clean_names(sourceTable,sourceQuery,sourceId,sourceNameList)

when you're working with notebooks in Databricks, they are not on some file system that is understood by Python as module.
If you want to include another notebook with some other definitions into the current context, you can use %run magic command, passing the name of another notebook as an argument:
%run /Shared/XXX/sharedModules/sharedFunctions
But the %run is not the full substitution for imports, as described in the documentation
You cannot use %run to run a Python file and import the entities defined in that file into a notebook. To import from a Python file you must package the file into a Python library, create a Databricks library from that Python library, and install the library into the cluster you use to run your notebook.
If you want to execute another notebook to get some results from it, you can use so-called notebook workflow - when exeucting via dbutils.notebook.run, the notebook is scheduled for execution, you can pass some parameters to it, etc., but results will be shared mostly via file system, managed table, etc.

AWS Lambda python: Unable to import module 'lambda_function': No module named 'regex._regex'

I am currently working with AWS Lambda. Here is an excerpt of the code:
import pandas as pd
import re
import nltk
from stop_words import get_stop_words
stopwords = get_stop_words('en')
nltk.download('punkt')
nltk.download('wordnet')
wn = nltk.WordNetLemmatizer()
def lemmatization(txt):
text = ([wn.lemmatize(word) for word in txt])
return text
def lambda_handler(event,context):
bucket = "aaabbb"
key = "cccddd"
s3_client = boto3.client('s3')
s3_file = s3_client.get_object(Bucket=bucket, Key=key)
s3_file_data = s3_file['Body'].read()
s3_file_data = io.BytesIO(s3_file_data)
df = pd.read_csv(s3_file_data)
df['ABC'] = df['ABC'].apply(lambda x: lemmatization(x))
print(df)
However, I am always getting the error:
Unable to import module 'lambda_function': No module named 'regex._regex'
I have already imported nltk and regex packages. Could you please help me with it?

A possible solution could be that your OS uses a different version of Python (i.e. 3.6) when downloading your dependencies than your Lambda Function (i.e. 3.7). I would suggest attempting to download whatever version of Python you are using for your lambda script, and then for example if I wanted the version of Python to be 3.8 I would run the code:
pip3.8 install -r requirements.txt -t aws-lib .

I faced this problem just like you. The problem causing this error is the difference in the OS that you use and lambda function uses. When python installs a package it creates the installed files with respect to the OS you use. So when you use the deployment bundle that was created using a linux OS then it would work with lambda function.
There are many ways for a windows user to do this, but I would recommend using docker containers to install your packages.
Steps to do it:
pull python:3.8 docker image (this is the highest version supported by lambda at the time of writing this answer)
run your container with the directory with the code mounted to the container as volume.
now inside the container navigate to the mounted folder and install your required packages using pip.
come out of your container and now use these installed packages to build your bundle and deploy it on AWS lambda
ps: now when you execute your code on windows it will give an error because the packages installed are built for linux OS

How to use/install python code/file in Juypter notebooks

I got code file as data_load_util.py from Git hub. I'm following some tutorial where this import is being used. Using Python 3.x and Juypter Notebooks with connection to SAP Hana 2.0 Express Edition.
File location - https://github.com/SAP-samples/hana-ml-samples/blob/master/Python-API/pal/notebooks/data_load_utils.py
Command I'm using for tutorial:
from hana_ml import dataframe
from data_load_utils import DataSets, Settings
Error I'm getting:
ModuleNotFoundError: No module named 'data_load_utils'
Since I found this utility data_load_util.py as code file but not sure how I use this or attach this to python or juypter notebooks so that I can use code and this error will be gone.
Help will be appreciated.
Link to error screen shot

You need to tell Jupyter where to look for modules via sys.path.
From this doc, you can add your module’s sub-directory to Python's path like this:
import os
import sys
sys.path.insert(0, os.path.abspath('../module-subdirectory'))
Then you can simply import it:
from data_load_utils import DataSets, Settings
Note: Here module-subdirectory is the sub-directory that has got data_load_util.py.
For alternate methods, please refer this doc.

running user-defined modules from a central directory

I am a beginner in python. I am using python 3.7 via anaconda. I have made certain modules that I want to re-use by importing them in python scripts. Is there a way to centrally store all my modules in one directory and import them when needed?
My actual python scripts may be in different project directories but I want the importable user-defined modules to be stored in a central directory.
Thanks

you can create a virtual environment for maintaining packages for a project.Create a requirement.txt file from virtulenv and installback when needed.
pip install virtualenv
you should activate virtual env before using it.
pip freeze > requirements.txt
pip install -r requirements.txt
Read about virtualenv here https://pypi.org/project/virtualenv/1.7.1.2/#:~:text=You%20can%20install%20virtualenv%20with,it%20with%20python%20virtualenv.py.
Relative import method for your modules:
change your working directory and import modules.
import os
os. chdir('your docs folder path')
from your_modules import *
from module2 import *
add your folder to sys path:
import sys
import os.path
sys.path.append(
os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir)))
import yourmodule

ImportError: cannot import name 'joblib' from 'sklearn.externals'

I am trying to load my saved model from s3 using joblib
import pandas as pd
import numpy as np
import json
import subprocess
import sqlalchemy
from sklearn.externals import joblib
ENV = 'dev'
model_d2v = load_d2v('model_d2v_version_002', ENV)
def load_d2v(fname, env):
model_name = fname
if env == 'dev':
try:
model=joblib.load(model_name)
except:
s3_base_path='s3://sd-flikku/datalake/doc2vec_model'
path = s3_base_path+'/'+model_name
command = "aws s3 cp {} {}".format(path,model_name).split()
print('loading...'+model_name)
subprocess.call(command)
model=joblib.load(model_name)
else:
s3_base_path='s3://sd-flikku/datalake/doc2vec_model'
path = s3_base_path+'/'+model_name
command = "aws s3 cp {} {}".format(path,model_name).split()
print('loading...'+model_name)
subprocess.call(command)
model=joblib.load(model_name)
return model
But I get this error:
from sklearn.externals import joblib
ImportError: cannot import name 'joblib' from 'sklearn.externals' (C:\Users\prane\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\externals\__init__.py)
Then I tried installing joblib directly by doing
import joblib
but it gave me this error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 8, in load_d2v_from_s3
File "/home/ec2-user/.local/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 585, in load
obj = _unpickle(fobj, filename, mmap_mode)
File "/home/ec2-user/.local/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 504, in _unpickle
obj = unpickler.load()
File "/usr/lib64/python3.7/pickle.py", line 1088, in load
dispatch[key[0]](self)
File "/usr/lib64/python3.7/pickle.py", line 1376, in load_global
klass = self.find_class(module, name)
File "/usr/lib64/python3.7/pickle.py", line 1426, in find_class
__import__(module, level=0)
ModuleNotFoundError: No module named 'sklearn.externals.joblib'
Can you tell me how to solve this?

You should directly use
import joblib
instead of
from sklearn.externals import joblib

It looks like your existing pickle save file (model_d2v_version_002) encodes a reference module in a non-standard location – a joblib that's in sklearn.externals.joblib rather than at top-level.
The current scikit-learn documentation only talks about a top-level joblib – eg in 3.4.1 Persistence example – but I do see a reference in someone else's old issue to a DeprecationWarning in scikit-learn version 0.21 about an older scikit.external.joblib variant going away:
Python37\lib\site-packages\sklearn\externals\joblib_init_.py:15:
DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and
will be removed in 0.23. Please import this functionality directly
from joblib, which can be installed with: pip install joblib. If this
warning is raised when loading pickled models, you may need to
re-serialize those models with scikit-learn 0.21+.
'Deprecation' means marking something as inadvisable to rely-upon, as it is likely to be discontinued in a future release (often, but not always, with a recommended newer way to do the same thing).
I suspect your model_d2v_version_002 file was saved from an older version of scikit-learn, and you're now using scikit-learn (aka sklearn) version 0.23+ which has totally removed the sklearn.external.joblib variation. Thus your file can't be directly or easily loaded to your current environment.
But, per the DeprecationWarning, you can probably temporarily use an older scikit-learn version to load the file the old way once, then re-save it with the now-preferred way. Given the warning info, this would probably require scikit-learn version 0.21.x or 0.22.x, but if you know exactly which version your model_d2v_version_002 file was saved from, I'd try to use that. The steps would roughly be:
create a temporary working environment (or roll back your current working environment) with the older sklearn
do imports something like:
import sklearn.external.joblib as extjoblib
import joblib
extjoblib.load() your old file as you'd planned, but then immediately re-joblib.dump() the file using the top-level joblib. (You likely want to use a distinct name, to keep the older file around, just in case.)
move/update to your real, modern environment, and only import joblib (top level) to use joblib.load() - no longer having any references to `sklearn.external.joblib' in either your code, or your stored pickle files.

You can import joblib directly by installing it as a dependency and using import joblib,
Documentation.

Maybe your code is outdated. For anyone who aims to use fetch_mldata in digit handwritten project, you should fetch_openml instead. (link)
In old version of sklearn:
from sklearn.externals import joblib
mnist = fetch_mldata('MNIST original')
In sklearn 0.23 (stable release):
import sklearn.externals
import joblib
dataset = datasets.fetch_openml("mnist_784")
features = np.array(dataset.data, 'int16')
labels = np.array(dataset.target, 'int')
For more info about deprecating fetch_mldata see scikit-learn doc

none of the answers below works for me, with a little changes this modification was ok for me
import sklearn.externals as extjoblib
import joblib

for this error, I had to directly use the following and it worked like a charm:
import joblib
Simple

In case the execution / call to joblib is within another .py program instead of your own (in such case even you have installed joblib, it still causes error from within the calling python programme unless you change the code, i thought would be messy), I tried to create a hardlink:
(windows version)
Python> import joblib
then inside your sklearn path >......\Lib\site-packages\sklearn\externals
mklink /J ./joblib .....\Lib\site-packages\joblib
(you can work out the above using a ! or %, !mklink....... or %mklink...... inside your Python juptyter notebook , or use python OS command...)
This effectively create a virtual folder of joblib within the "externals" folder
Remarks:
Of course to be more version resilient, your code has to check for the version of sklearn is >= 0.23 again before hand.
This would be alternative to changing sklearn vesrion.

When getting error:
from sklearn.externals import joblib it deprecated older version.
For new version follow:
conda install -c anaconda scikit-learn (install using "Anaconda Promt")
import joblib (Jupyter Notebook)

I had the same problem
What I did not realize was that joblib was already installed!
so what you have to do is replace
from sklearn.externals import joblib
with
import joblib
and that is it

After a long investigation, given my computer setup, I've found that was because an SSL certificate was required to download the dataset.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to import library from a given path in Azure Databricks? - databricks

Related

azure pyspark from modules import myfunctions; No module name

AWS Lambda python: Unable to import module 'lambda_function': No module named 'regex._regex'

How to use/install python code/file in Juypter notebooks

running user-defined modules from a central directory

ImportError: cannot import name 'joblib' from 'sklearn.externals'

Categories

Resources