How to use/install python code/file in Juypter notebooks - python-3.x

I got code file as data_load_util.py from Git hub. I'm following some tutorial where this import is being used. Using Python 3.x and Juypter Notebooks with connection to SAP Hana 2.0 Express Edition.
File location - https://github.com/SAP-samples/hana-ml-samples/blob/master/Python-API/pal/notebooks/data_load_utils.py
Command I'm using for tutorial:
from hana_ml import dataframe
from data_load_utils import DataSets, Settings
Error I'm getting:
ModuleNotFoundError: No module named 'data_load_utils'
Since I found this utility data_load_util.py as code file but not sure how I use this or attach this to python or juypter notebooks so that I can use code and this error will be gone.
Help will be appreciated.
Link to error screen shot

You need to tell Jupyter where to look for modules via sys.path.
From this doc, you can add your module’s sub-directory to Python's path like this:
import os
import sys
sys.path.insert(0, os.path.abspath('../module-subdirectory'))
Then you can simply import it:
from data_load_utils import DataSets, Settings
Note: Here module-subdirectory is the sub-directory that has got data_load_util.py.
For alternate methods, please refer this doc.

Related

How to import library from a given path in Azure Databricks?

I want to import a standard library (of a given version) in a Databricks Notebook Job. I do not want to install the library everytime a job cluster is created for this job. I want to install the library in a DBFS location and import the library directly from that location (by changing sys.path or something similar).
This is working in local:
I installed a library in a given path using:
pip install --target=customLocation library=major.minor
Append custom Location to sys.path variable:
sys.path.insert(0, 'customLocation')
When I import the library and check the location, I get the expected response
import library
print(library.__file__)
#Output - customLocation\library\__init__.py
However, the same exact sequence does not work in Databricks:
I installed the library in DBFS location:
%pip install --target='dbfs/customFolder/' numpy==1.19.4
Append sys.path variable:
sys.path.insert(0, 'dbfs/customFolder/')
Tried to find numpy version and file location:
import numpy
print(numpy.__version__)
print(numpy.__file__)
#Output - 1.21.4 (Databricks Runtime 10.4 Default Numpy)
/dbfs/customFolder/numpy/__init__.py
The customFolder has version 1.19.4 and the imported numpy shows that location, but version number does not match?
How exactly is the imports working in databricks to create this behaviour?
I also tried importing using the importlib function and the result remains the same. Link - https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly
import importlib.util
import sys
spec = importlib.util.spec_from_file_location('numpy', '/dbfs/customFolder/')
module = importlib.util.module_from_spec(spec)
sys.modules['numpy'] = module
spec.loader.exec_module(module)
import numpy
print(numpy.__version__)
print(numpy.__file__)
Related:
How to install Python Libraries in a DBFS Location and access them through Job Clusters (without installing it for every job)?

azure pyspark from modules import myfunctions; No module name

I have tried a number of methods to import a local script containing an bunch of shared functions from our shared team directory with example in the code below. I also tried "from . import sharedFunctions" with the importing script in the same directory and "from sharedModules import sharedFunctions" from the parent directory. All of these return No module named 'sharedFunctions' based on some google searches. What is the best way to set this up in Azure?
Thanks
import sys, os
dir_path = '/Shared/XXX/sharedModules'
sys.path.insert(0, dir_path)
print(sys.path)
# dir_path = os.path.dirname(os.path.realpath(__file__))
# sys.path.insert(0, dir_path)
import sharedFunctions
sourceTable='dnb_raw'
sourceQuery='select DUNSNumber , GlobalUltimate_Name, BusinessName from'
sourceId = 'DUNSNumber'
sourceNameList=['Tradestyle','BusinessName']
NewTable = 'default.' + sourceTable + '_enhanced'
#dbutils.fs.rm("dbfs:/" + NewTable + "/",recurse=True)
clean_names(sourceTable,sourceQuery,sourceId,sourceNameList)
when you're working with notebooks in Databricks, they are not on some file system that is understood by Python as module.
If you want to include another notebook with some other definitions into the current context, you can use %run magic command, passing the name of another notebook as an argument:
%run /Shared/XXX/sharedModules/sharedFunctions
But the %run is not the full substitution for imports, as described in the documentation
You cannot use %run to run a Python file and import the entities defined in that file into a notebook. To import from a Python file you must package the file into a Python library, create a Databricks library from that Python library, and install the library into the cluster you use to run your notebook.
If you want to execute another notebook to get some results from it, you can use so-called notebook workflow - when exeucting via dbutils.notebook.run, the notebook is scheduled for execution, you can pass some parameters to it, etc., but results will be shared mostly via file system, managed table, etc.

NLTK is called and got error of "punkt" not found on databricks pyspark

I would like to call NLTK to do some NLP on databricks by pyspark.
I have installed NLTK from the library tab of databricks. It should be accessible from all nodes.
My py3 code :
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import nltk
nltk.download('punkt')
def get_keywords1(col):
sentences = []
sentence = nltk.sent_tokenize(col)
get_keywords_udf = F.udf(get_keywords1, StringType())
I run the above code and got:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
When I run the following code:
t = spark.createDataFrame(
[(2010, 1, 'rdc', 'a book'), (2010, 1, 'rdc','a car'),
(2007, 6, 'utw', 'a house'), (2007, 6, 'utw','a hotel')
],
("year", "month", "u_id", "objects"))
t1 = t.withColumn('keywords', get_keywords_udf('objects'))
t1.show() # error here !
I got error:
<span class="ansi-red-fg">>>> import nltk
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/root/nltk_data'
- '/databricks/python/nltk_data'
- '/databricks/python/share/nltk_data'
- '/databricks/python/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
I have downloaded 'punkt'. It is located at
/root/nltk_data/tokenizers
I have updated the PATH in spark environment with the folder location.
Why it cannot be found ?
The solution at NLTK. Punkt not found and this How to config nltk data directory from code?
but none of them work for me.
I have tried to updated
nltk.data.path.append('/root/nltk_data/tokenizers/')
it does not work.
It seems that nltk cannot see the new added path !
I also copied punkz to the path where nltk will search for.
cp -r /root/nltk_data/tokenizers/punkt /root/nltk_data
but, nltk still cannot see it.
thanks
When spinning up a Databricks single node cluster this will work fine. Installing nltk via pip and then using the nltk.download module to get the prebuilt models/text works.
Assumptions: User is programming in a Databricks notebook with python as the default language.
When spinning up a multinode cluster there are a couple of issues you will run into.
You are registering a UDF that relies on code from another module. In order for this to UDF to work on every node in the cluster the module needs to be installed at the cluster level (i.e. nltk installed on driver and all worker nodes). The module can be installed like this via an init script at cluster start time or installed via the libraries section in the Databricks Compute section. More on that here...(I also give code examples below)
https://learn.microsoft.com/enus/azure/databricks/libraries/cluster-libraries.
Now when you run the UDF the module will exist on all nodes of the cluster.
Using nltk.download() to get data that the module references. When we do nltk.download() in a multinode cluster interactively it will only download to the driver node. So when your UDF executes on the other nodes those nodes will not contain the needed references in the specified paths that it looks in by default. To see these paths default paths run nltk.data.path.
To overcome this there are two possibilities I have explored. One of them works.
(doesn't work) Using an init script, install nltk, then in that same init script call nltk.download via a one-liner bash python expression after the install like...
python -c 'import nltk; nltk.download('all');'
I have run into the issue where the nltk is installed but not found after it has installed. I'm assuming virtual environments are playing a role here.
(works) Using an init script, install nltk.
Create the script
dbutils.fs.put('/dbfs/databricks/scripts/nltk-install.sh', """
#!/bin/bash
pip install nltk""", True)
Check it out
%sh
head '/dbfs/databricks/scripts/nltk-install.sh'
Configure cluster to run init script on start up
Databricks Cluster Init Script Config
In the cluster configuration create the environment variable NLTK_DATA="/dbfs/databricks/nltk_data/". This is used by the nltk package to search for data/model dependencies.
Databricks Cluster Env Variable Config
Start the cluster.
After it is installed and the cluster is running check to maker sure the environment variable was correctly created.
import os
os.environ.get("NLTK_DATA")
Then check to make sure that nltk is pointing towards the correct paths.
import nltk
nltk.data.path
If '/dbfs/databricks/nltk_data/ is within the list we are good to go.
Download the stuff you need.
nltk.download('all', download_dir="/dbfs/databricks/nltk_data/")
Notice that we downloaded the dependencies to Databricks storage. Now every node will have access to the nltk default dependencies. Because we specified the environment variable NLTK_DATA at cluster creation time when we import nltk it will look in that directory. The only difference here is that we now pointed nltk to our Databricks storage which is accessible by every node.
Now since the data exists in mounted storage at cluster start up we shouldn't need to redownload the data every time.
After following these steps you should be all set to play with nltk and all of its default data/models.
I recently encountered the same issue when using NLTK in a Glue job.
Adding the 'missing' file to all nodes resolved the issue for me. I'm not sure if it will help in databricks but is worth a shot.
sc.addFile('/tmp/nltk_data/tokenizers/punkt/PY3/english.pickle')
Drew Ringo's suggestion almost worked for me.
If you're using a multinode cluster in Databricks, you will face the problems Ringo mentioned. For me a much simpler solution was running the following init_script:
dbutils.fs.put("dbfs:/databricks/scripts/nltk_punkt.sh", """#!/bin/bash
pip install nltk
python -m nltk.downloader punkt""",True)
Make sure to add the filepath under Advanced options -> Init Scripts found within the Cluster Configuration menu.
The first of Drew Ringo's 2 possibilities will work if your cluster's init_script looks like this:
%sh
/databricks/python/bin/pip install nltk
/databricks/python/bin/python -m nltk.downloader punkt
He is correct to assume that his original issue relates to virtual environments.
This helped me to solve the issue:
import nltk
nltk.download('all')

AttributeError when packaging python library

I'm in the process of learning how to package a python library using the official guide. I've started cloning the minimal sample package suggested in the guide here. I've then added the file my_module.py inside the folder sampleproject storing a simple power function. Another function is also stored in /sampleproject/sampleproject/__init__.py. The resulting structure of the library is the following
Finally, I've used pip to successfully install the package in the interpreter. The only thing left is to make sure that I'm able to run the functions stored in the subfolder sampleproject.
import sampleproject
sampleproject.main()
# Output
"Call your main application code here"
This is great. The package is able to run the function in __init__.py. However, the package is not able to find module.py:
import sampleproject
sampleproject.module
# Output
AttributeError: module 'sampleproject' has no attribute 'module'
I've tried to add __init__.py in the main folder and to change the settings in entry_points in setup.py without success. What should I let sampleproject to be able to find the function in module.py?
Your sampleproject.module is a function you would like to execute?
In this case, do as for the sampleproject, add () to execute it:
sampleproject.module()
Otherwise, you can import your package like this:
import sampleproject.module
or:
from sampleproject import module
To be clearer, you would have to import module in your sampleproject __init__.py. Then, when you want to use the package, import it (is some py file at root):
import sampleproject # is enough as it's going to import everything you stated in __init__.py
After that, you can start to use what's in the package you imported with maybe module() if you have a function called module in your package.
init.py discussions
it seems,
you are in sampleproject->module.py
so you need to try,
from sampleproject import module

Python 3: No module named zlib?

I am trying to run my Flask application with an Apache server using mod_wsgi, and it has been a bumpy road, to say the least.
It has been suggested that I should try to run my app's .wsgi file using Python to make sure it is working.
This is the contents of the file:
#!/usr/bin/python
activate_this = '/var/www/Giveaway/Giveaway/venv/bin/activate_this.py'
with open(activate_this) as f:
code = compile(f.read(), "somefile.py", 'exec')
exec(code)
import sys
import logging
logging.basicConfig(stream=sys.stderr)
sys.path.insert(0,"/var/www/Giveaways/")
from Giveaways import application
application.secret_key = 'Add your secret key'
However, when I run it, I get this error:
ImportError: No module named 'zlib'
And no, I am not using some homebrewed version of Python - I installed Python 3 via apt-get.
Thanks for any help.
does the contents of somefile.py include the gzip package? in which case you may have to install gzip package via pip or similar

Resources