Using pyspark with pybuilder - python-3.x

We are setting pybuilder on a new big data project. We have to tests that some classes build the correct distributed tables. As a consequence we built a few unitests that pass when running them on eclipse/pydev.
I run independant unit tests succesfully but when i ad the one using pyspark i have a long list of java exception starting with:
ERROR Utils:91 - Aborting task
ExitCodeException exitCode=-1073741515:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
This my build.py file:
from pybuilder.core import use_plugin
from pybuilder.core import init
import sys
import os
sys.path.append(os.path.join(os.environ['SPARK_HOME'], 'python\lib\py4j-0.10.7-src.zip'))
sys.path.append(os.path.join(os.environ['SPARK_HOME'], 'python'))
use_plugin("python.core")
use_plugin("python.unittest")
use_plugin("python.install_dependencies")
default_task = "publish"
We are using pyspark 2.3.1 and python 3.7.
What am i doing wrong?

The solution for me was to execute winutils CHMOD 777 -R in my workspace after installing Microsoft Visual C++ 2010 Redistributable Package

Related

Debug PySpark in VS Code

I'm building a project in PySpark using VS Code. I have Python and Spark installed and PySpark is correctly imported and running in a Jupyter notebook. Do do so, I run:
import findspark
findspark.init()
import pyspark
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
[my code... ]
Now, how do I debug my PySpark code in VS Code? I don't want to run findspark in my project. Do I need to create a virtual environment and run a pre-script? What are the best practices?
So the steps I followed and what I learnt:
make sure Python, Spark and the environmental variables are correctly installed and set
Create a virtual environment, and pip install all required libraries (including pyspark)
Run de debugger. The debugger will just analyze the code and not run it. So a SparkContext and a SparkSession are not required.

How to import library from a given path in Azure Databricks?

I want to import a standard library (of a given version) in a Databricks Notebook Job. I do not want to install the library everytime a job cluster is created for this job. I want to install the library in a DBFS location and import the library directly from that location (by changing sys.path or something similar).
This is working in local:
I installed a library in a given path using:
pip install --target=customLocation library=major.minor
Append custom Location to sys.path variable:
sys.path.insert(0, 'customLocation')
When I import the library and check the location, I get the expected response
import library
print(library.__file__)
#Output - customLocation\library\__init__.py
However, the same exact sequence does not work in Databricks:
I installed the library in DBFS location:
%pip install --target='dbfs/customFolder/' numpy==1.19.4
Append sys.path variable:
sys.path.insert(0, 'dbfs/customFolder/')
Tried to find numpy version and file location:
import numpy
print(numpy.__version__)
print(numpy.__file__)
#Output - 1.21.4 (Databricks Runtime 10.4 Default Numpy)
/dbfs/customFolder/numpy/__init__.py
The customFolder has version 1.19.4 and the imported numpy shows that location, but version number does not match?
How exactly is the imports working in databricks to create this behaviour?
I also tried importing using the importlib function and the result remains the same. Link - https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly
import importlib.util
import sys
spec = importlib.util.spec_from_file_location('numpy', '/dbfs/customFolder/')
module = importlib.util.module_from_spec(spec)
sys.modules['numpy'] = module
spec.loader.exec_module(module)
import numpy
print(numpy.__version__)
print(numpy.__file__)
Related:
How to install Python Libraries in a DBFS Location and access them through Job Clusters (without installing it for every job)?

Azure DataBrick PYTHONPATH pointing at imported wheel?

I successfully created a python wheel of my python project following simple steps from here: https://python101.pythonlibrary.org/chapter39_wheels.html
Then from my DataBrick Notebook, installed my project dependencies (I uploaded separately my project's requirements.txt to my blob storage):
%pip install -r /dbfs/mnt/testdb-blob-container1/requirements.txt
I then uploaded python wheel of my project via Azure DataBrick interface: https://learn.microsoft.com/en-us/azure/databricks/libraries/workspace-libraries
From my DataBrick Notebook, i successfully reference:
import myproject
import myproject.src
from myproject.src.core import constants as constants <-- This is fine.
But this blew up because my datetimeutil needs "constants". From local, we have PYTHONPATH. In DataBrick, we dont have this, so below attempt to import datetimeutil below up:
from myproject.src.helpers import datetimeutil as datetimeutil
How do we set PYTHONPATH in DataBrick environment?
One thing i tried is... my wheel file is here:
dbfs:/FileStore/jars/23011937_5e16_4be0_b82a_88e83aaecadf/myproject-1.0-py3-none-any.whl
From my notebook:
import sys
sys.path.append("dbfs:/FileStore/jars/23011937_5e16_4be0_b82a_88e83aaecadf/")
This did not work.
Thanks
Navigate to you cluster > Libraries and click on Install New button.
With Library Source = Upload and Library Type = Python Whl drag and drop the whl file. This will install the python custom library on the Databricks cluster.

NLTK is called and got error of "punkt" not found on databricks pyspark

I would like to call NLTK to do some NLP on databricks by pyspark.
I have installed NLTK from the library tab of databricks. It should be accessible from all nodes.
My py3 code :
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import nltk
nltk.download('punkt')
def get_keywords1(col):
sentences = []
sentence = nltk.sent_tokenize(col)
get_keywords_udf = F.udf(get_keywords1, StringType())
I run the above code and got:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
When I run the following code:
t = spark.createDataFrame(
[(2010, 1, 'rdc', 'a book'), (2010, 1, 'rdc','a car'),
(2007, 6, 'utw', 'a house'), (2007, 6, 'utw','a hotel')
],
("year", "month", "u_id", "objects"))
t1 = t.withColumn('keywords', get_keywords_udf('objects'))
t1.show() # error here !
I got error:
<span class="ansi-red-fg">>>> import nltk
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/root/nltk_data'
- '/databricks/python/nltk_data'
- '/databricks/python/share/nltk_data'
- '/databricks/python/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
I have downloaded 'punkt'. It is located at
/root/nltk_data/tokenizers
I have updated the PATH in spark environment with the folder location.
Why it cannot be found ?
The solution at NLTK. Punkt not found and this How to config nltk data directory from code?
but none of them work for me.
I have tried to updated
nltk.data.path.append('/root/nltk_data/tokenizers/')
it does not work.
It seems that nltk cannot see the new added path !
I also copied punkz to the path where nltk will search for.
cp -r /root/nltk_data/tokenizers/punkt /root/nltk_data
but, nltk still cannot see it.
thanks
When spinning up a Databricks single node cluster this will work fine. Installing nltk via pip and then using the nltk.download module to get the prebuilt models/text works.
Assumptions: User is programming in a Databricks notebook with python as the default language.
When spinning up a multinode cluster there are a couple of issues you will run into.
You are registering a UDF that relies on code from another module. In order for this to UDF to work on every node in the cluster the module needs to be installed at the cluster level (i.e. nltk installed on driver and all worker nodes). The module can be installed like this via an init script at cluster start time or installed via the libraries section in the Databricks Compute section. More on that here...(I also give code examples below)
https://learn.microsoft.com/enus/azure/databricks/libraries/cluster-libraries.
Now when you run the UDF the module will exist on all nodes of the cluster.
Using nltk.download() to get data that the module references. When we do nltk.download() in a multinode cluster interactively it will only download to the driver node. So when your UDF executes on the other nodes those nodes will not contain the needed references in the specified paths that it looks in by default. To see these paths default paths run nltk.data.path.
To overcome this there are two possibilities I have explored. One of them works.
(doesn't work) Using an init script, install nltk, then in that same init script call nltk.download via a one-liner bash python expression after the install like...
python -c 'import nltk; nltk.download('all');'
I have run into the issue where the nltk is installed but not found after it has installed. I'm assuming virtual environments are playing a role here.
(works) Using an init script, install nltk.
Create the script
dbutils.fs.put('/dbfs/databricks/scripts/nltk-install.sh', """
#!/bin/bash
pip install nltk""", True)
Check it out
%sh
head '/dbfs/databricks/scripts/nltk-install.sh'
Configure cluster to run init script on start up
Databricks Cluster Init Script Config
In the cluster configuration create the environment variable NLTK_DATA="/dbfs/databricks/nltk_data/". This is used by the nltk package to search for data/model dependencies.
Databricks Cluster Env Variable Config
Start the cluster.
After it is installed and the cluster is running check to maker sure the environment variable was correctly created.
import os
os.environ.get("NLTK_DATA")
Then check to make sure that nltk is pointing towards the correct paths.
import nltk
nltk.data.path
If '/dbfs/databricks/nltk_data/ is within the list we are good to go.
Download the stuff you need.
nltk.download('all', download_dir="/dbfs/databricks/nltk_data/")
Notice that we downloaded the dependencies to Databricks storage. Now every node will have access to the nltk default dependencies. Because we specified the environment variable NLTK_DATA at cluster creation time when we import nltk it will look in that directory. The only difference here is that we now pointed nltk to our Databricks storage which is accessible by every node.
Now since the data exists in mounted storage at cluster start up we shouldn't need to redownload the data every time.
After following these steps you should be all set to play with nltk and all of its default data/models.
I recently encountered the same issue when using NLTK in a Glue job.
Adding the 'missing' file to all nodes resolved the issue for me. I'm not sure if it will help in databricks but is worth a shot.
sc.addFile('/tmp/nltk_data/tokenizers/punkt/PY3/english.pickle')
Drew Ringo's suggestion almost worked for me.
If you're using a multinode cluster in Databricks, you will face the problems Ringo mentioned. For me a much simpler solution was running the following init_script:
dbutils.fs.put("dbfs:/databricks/scripts/nltk_punkt.sh", """#!/bin/bash
pip install nltk
python -m nltk.downloader punkt""",True)
Make sure to add the filepath under Advanced options -> Init Scripts found within the Cluster Configuration menu.
The first of Drew Ringo's 2 possibilities will work if your cluster's init_script looks like this:
%sh
/databricks/python/bin/pip install nltk
/databricks/python/bin/python -m nltk.downloader punkt
He is correct to assume that his original issue relates to virtual environments.
This helped me to solve the issue:
import nltk
nltk.download('all')

SparkSession | Ubuntu | Pycharm not working

I'm trying to use PySpark locally on Ubuntu using PyCharm rather than a jupyter notebook in order to build an Electron app. However, when I'm trying to set up a SparkSession, it doesn't work. When I try this:
spark = SparkSession.builder.master('local[*]').appName('Search').enableHiveSupport().getOrCreate
df = pd.DataFrame([1,2,3], columns=['Test'])
myschema = StructType([StructField('Test'),Integertype(),True)])
df2 = spark.createDataFrame(df,schema=myschema)
print(type(df2))
the session opens but it tells me
"AttributeError: 'function' object has no attribute 'createDataFrame' "
Then, rewrite the above with ".getOrCreate()" and it tells me
"FileNotFoundError: [Error 2] No such file or directory "home/...././bin/spark-submit'
I guess the set up in Pycharm might be off, but I don't really understand why.
You need to to use method invocation getOrCreate(), not getOrCreate. Also, make sure you install pyspark inside the python interpreter used for your project in pycharm. You can access it via Preferences -> Python Interpreter in pycharm.
Update:
Try downloading and extracting the spark binaries (e.g. spark 2.4.0) on your local and then add the following entries in your bashrc (and source it). I'm assuming you're using spark 2.4.0, so the py4j is specific to this version. For any other version of spark, check the py4j version and add accordingly.
export SPARK_HOME=/<your_path>/spark-2.4.0-bin-hadoop2.7
export PYTHONPATH=${SPARK_HOME}/python:$PYTHONPATH
export PYTHONPATH=${SPARK_HOME}/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
export PYSPARK_PYTHON=/<location_of_python_interpreter>
Whatever python interpreter you're linking into PYSPARK_PYTHON, make sure to use the same python in your pycharm project.

Resources