How to provide a matplotlib Python dependency to spark-submit? - apache-spark

I have inherited a web application that delegates some of its work to other machines via spark-submit.
spark-submit -v --py-files /somedir/python/pyspark_utils*.egg --master yarn /somedir/python/driver.py arg1 gwas-manhattan-plot 58
Our code relies on matplotlib for plotting, so we build using setuptools with
python setup.py bdist_egg.
The setuptools web page says:
When your project is installed (e.g., using pip), all of the
dependencies not already installed will be located (via PyPI),
downloaded, built (if necessary), and installed...
Within the generated pyspark_utils-0.0.1-py3.9.egg is a requires.txt file with the contents:
matplotlib
numpy
Unfortunately, when I run the spark-submit command, it can't find matplotlib:
ModuleNotFoundError: No module named 'matplotlib'
While the egg file indicates a dependency on matplotlib, it doesn't actually include the packaged code. How can I indicate to spark-submit where to find matplotlib? I thought of adding another --py-files argument for matplotlib, but matplotlib is available as a wheel, not as an egg file, and as I understand it spark-submit can't handle wheels.

Related

NLTK is called and got error of "punkt" not found on databricks pyspark

I would like to call NLTK to do some NLP on databricks by pyspark.
I have installed NLTK from the library tab of databricks. It should be accessible from all nodes.
My py3 code :
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import nltk
nltk.download('punkt')
def get_keywords1(col):
sentences = []
sentence = nltk.sent_tokenize(col)
get_keywords_udf = F.udf(get_keywords1, StringType())
I run the above code and got:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
When I run the following code:
t = spark.createDataFrame(
[(2010, 1, 'rdc', 'a book'), (2010, 1, 'rdc','a car'),
(2007, 6, 'utw', 'a house'), (2007, 6, 'utw','a hotel')
],
("year", "month", "u_id", "objects"))
t1 = t.withColumn('keywords', get_keywords_udf('objects'))
t1.show() # error here !
I got error:
<span class="ansi-red-fg">>>> import nltk
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/root/nltk_data'
- '/databricks/python/nltk_data'
- '/databricks/python/share/nltk_data'
- '/databricks/python/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
I have downloaded 'punkt'. It is located at
/root/nltk_data/tokenizers
I have updated the PATH in spark environment with the folder location.
Why it cannot be found ?
The solution at NLTK. Punkt not found and this How to config nltk data directory from code?
but none of them work for me.
I have tried to updated
nltk.data.path.append('/root/nltk_data/tokenizers/')
it does not work.
It seems that nltk cannot see the new added path !
I also copied punkz to the path where nltk will search for.
cp -r /root/nltk_data/tokenizers/punkt /root/nltk_data
but, nltk still cannot see it.
thanks
When spinning up a Databricks single node cluster this will work fine. Installing nltk via pip and then using the nltk.download module to get the prebuilt models/text works.
Assumptions: User is programming in a Databricks notebook with python as the default language.
When spinning up a multinode cluster there are a couple of issues you will run into.
You are registering a UDF that relies on code from another module. In order for this to UDF to work on every node in the cluster the module needs to be installed at the cluster level (i.e. nltk installed on driver and all worker nodes). The module can be installed like this via an init script at cluster start time or installed via the libraries section in the Databricks Compute section. More on that here...(I also give code examples below)
https://learn.microsoft.com/enus/azure/databricks/libraries/cluster-libraries.
Now when you run the UDF the module will exist on all nodes of the cluster.
Using nltk.download() to get data that the module references. When we do nltk.download() in a multinode cluster interactively it will only download to the driver node. So when your UDF executes on the other nodes those nodes will not contain the needed references in the specified paths that it looks in by default. To see these paths default paths run nltk.data.path.
To overcome this there are two possibilities I have explored. One of them works.
(doesn't work) Using an init script, install nltk, then in that same init script call nltk.download via a one-liner bash python expression after the install like...
python -c 'import nltk; nltk.download('all');'
I have run into the issue where the nltk is installed but not found after it has installed. I'm assuming virtual environments are playing a role here.
(works) Using an init script, install nltk.
Create the script
dbutils.fs.put('/dbfs/databricks/scripts/nltk-install.sh', """
#!/bin/bash
pip install nltk""", True)
Check it out
%sh
head '/dbfs/databricks/scripts/nltk-install.sh'
Configure cluster to run init script on start up
Databricks Cluster Init Script Config
In the cluster configuration create the environment variable NLTK_DATA="/dbfs/databricks/nltk_data/". This is used by the nltk package to search for data/model dependencies.
Databricks Cluster Env Variable Config
Start the cluster.
After it is installed and the cluster is running check to maker sure the environment variable was correctly created.
import os
os.environ.get("NLTK_DATA")
Then check to make sure that nltk is pointing towards the correct paths.
import nltk
nltk.data.path
If '/dbfs/databricks/nltk_data/ is within the list we are good to go.
Download the stuff you need.
nltk.download('all', download_dir="/dbfs/databricks/nltk_data/")
Notice that we downloaded the dependencies to Databricks storage. Now every node will have access to the nltk default dependencies. Because we specified the environment variable NLTK_DATA at cluster creation time when we import nltk it will look in that directory. The only difference here is that we now pointed nltk to our Databricks storage which is accessible by every node.
Now since the data exists in mounted storage at cluster start up we shouldn't need to redownload the data every time.
After following these steps you should be all set to play with nltk and all of its default data/models.
I recently encountered the same issue when using NLTK in a Glue job.
Adding the 'missing' file to all nodes resolved the issue for me. I'm not sure if it will help in databricks but is worth a shot.
sc.addFile('/tmp/nltk_data/tokenizers/punkt/PY3/english.pickle')
Drew Ringo's suggestion almost worked for me.
If you're using a multinode cluster in Databricks, you will face the problems Ringo mentioned. For me a much simpler solution was running the following init_script:
dbutils.fs.put("dbfs:/databricks/scripts/nltk_punkt.sh", """#!/bin/bash
pip install nltk
python -m nltk.downloader punkt""",True)
Make sure to add the filepath under Advanced options -> Init Scripts found within the Cluster Configuration menu.
The first of Drew Ringo's 2 possibilities will work if your cluster's init_script looks like this:
%sh
/databricks/python/bin/pip install nltk
/databricks/python/bin/python -m nltk.downloader punkt
He is correct to assume that his original issue relates to virtual environments.
This helped me to solve the issue:
import nltk
nltk.download('all')

Tensorflow installed can't be imported

I'm trying to import tensorflow. But even after installing it, it doesn't seem to be recognized.
>conda create -n tf tensorflow
>conda activate tf
(tf)>pip install --ignore-installed --upgrade tensorflow==1.15 --user
...
Successfully installed absl-py-0.9.0 astor-0.8.1 gast-0.2.2 google-pasta-0.2.0 grpcio-1.28.1 h5py-2.10.0 keras-applications-1.0.8 keras-preprocessing-1.1.0 markdown-3.2.1 numpy-1.18.2 opt-einsum-3.2.0 protobuf-3.11.3 setuptools-46.1.3 six-1.14.0 tensorboard-1.15.0 tensorflow-1.15.0 tensorflow-estimator-1.15.1 termcolor-1.1.0 werkzeug-1.0.1 wheel-0.34.2 wrapt-1.12.1
(tf) C:\Users\antoi\Documents\Programming\Covent Garden\covent_garden_ds>python3 app.py
Traceback (most recent call last):
File "app.py", line 4, in <module>
from tensorflow.keras.callbacks import ModelCheckpoint
ModuleNotFoundError: No module named 'tensorflow'
Python3 is there:
(tf) C:\Users\antoi\Documents\Programming\Covent Garden\covent_garden_ds>where python3
C:\Users\antoi\AppData\Local\Microsoft\WindowsApps\python3.exe
It's not the one I should be using, isn't it?
What version of Python are you running? In order to both successfully import and run the Tensorflow module, you must have the 64 bit version of Python installed. If you are using the latest version of Python, which to the best of my knowledge is 3.8.2, completely uninstall that version of Python, and downgrade to the latest Python version with 64 bit support.
If you follow the output of pip --version to find where your anaconda files are located, you can find the anaconda python executable, usually about two directory levels higher (if pip is in C:\example\anaconda\lib\site-packages, then python is probably in C:\example\anaconda) and use a full path to that python executable to run the file like C:\example\anaconda\python app.py. Or you could update your path environment variable to replace C:\Users\antoi\AppData\Local\Microsoft\WindowsApps\ with the directory containing the anaconda python executable
I've had this same exact issue (but on macOS) several times before I realized what was wrong, and I've seen several others have this issue too. I wish there was a way Python could somehow better regulate this to make sure the default executables for pip and python are always in sync

Using Homebrew python3 with both homebrew packages and pip/pip3 packages in Visual Studio Code for Mac OS

I am currently trying to setup Visual Studio Code on Mac OSX 10.13.6 for coding in python3. I'd like to avoid using multiple virtual environments for my different python3 scripts and instead have them all run using:
(1) the same homebrew installation of python3
(2) accessing installed python packages in:
homebrew packages list
pip3 installed package list
pip installed packages list.
First, I first installed python3 using homebrew:
$ brew info python
python: stable 3.7.7 (bottled), HEAD
Interpreted, interactive, object-oriented programming language
https://www.python.org/
/usr/local/Cellar/python/3.7.7 (4,062 files, 62.4MB)
...
Python has been installed as
/usr/local/bin/python3
...
You can install Python packages with
pip3 install <package>
They will install into the site-package directory
/usr/local/lib/python3.7/site-packages
Second, I installed my required packages using homebrew:
$ brew list
cmake libffi p11-kit
dcraw libheif pandoc
dlib libidn2 pcre
...
jasper numpy webp
...
And other packages using pip and pip3:
$ pip list
DEPRECATION:...
Package Version
-------------------------------------- --------
altgraph 0.10.2
...
numpy 1.8.0rc1
...
zope.interface 4.1.1
$
$ pip3 list
Package Version
------------------ -------
appnope 0.1.0
...
numpy 1.18.2
pandocfilters 1.4.2
parso 0.5.2
pexpect 4.7.0
pickleshare 0.7.5
pip 20.0.2
pomegranate 0.12.2
...
scipy 1.4.1
Third, I opened Visual Studio Code and in "Preferences" -> "Settings" and set "Python:Python Path" to the homebrew python3 installation as noted above /usr/local/bin/python3.
See this screenshot:
Next, I added /usr/local/lib/python3.7/site-packages per the homebrew install of python3 to the Visual Studio Code Settings file using:
"python.autoComplete.extraPaths": [
"/usr/local/lib/python3.7/site-packages" ]
Finally, I selected my python interpreter in Visual Studio Code as /usr/local/bin/python3 and tried to run the following 2-lines of imports in a .py script as per the screenshot below. Note that the interpreter is Python 3.7.0 64-bit given by the bottom left corner of the VS Code window.
And after all of that, ended up with this output after clicking the "Play" button to run the code in the top right corner of VS Code:
[Running] python -u "/Users/...bayes_net_nodes.py"
Traceback (most recent call last):
File "/Users/...bayes_net_nodes.py", line 1, in <module>
import numpy as np
ModuleNotFoundError: No module named 'numpy'
[Done] exited with code=1 in 0.037 seconds
What would be the most simple way to configure VS Code so I can run python3 scripts that have access to the all the packages I've installed across my system without using virtual environments? Thank you!
Note: One workaround that seems to work, and I'm not sure why is if I put a shebang at the top of my script #! /usr/local/bin/python3 and my output then looks like this:
[Running] /usr/local/bin/python3 "/Users/...bayes_net_nodes.py"
[Done] exited with code=0 in 0.051 seconds
Which is odd, because that's different than the output above where I didn't use the shebang but both python interpreters according to VSCode are indeed /usr/local/bin/python3
I was able to reproduce your problem.. but only when I use Code Runner to run.
Those kind of Output logs with [Running] and [Done] is Code Runner.
The play button is also not green, indicating Code Runner because the default is green.
Now, for the fix!
You'll notice that it executed your script using python -u. That python would be whatever python means on your system, which for me is the default Python 2.7. Basically, it's not your Homebrew Python3 with numpy.
Code Runner has a default set of "executors" which tells it which executable to use for which language. Search it for in your settings as "code-runner Executor Map":
Open your settings.json, enter code-runner.executorMap, then let it auto-complete with the default. You'll then see a long list of mappings between language and executor. Look for the one for python:
"code-runner.executorMap": {
"javascript": "node",
...
"python": "python -u",
"perl": "perl",
...
}
And there it is: python -u, the same one it used to run your script.
If you want to continue using Code Runner, simply change it to the whichever python interpreter you want to use. In your case, it should be /usr/local/bin/python3:
"code-runner.executorMap": {
...
"python": "/usr/local/bin/python3",
...
}
It should now work:
The reason it works with a #! /usr/local/bin/python3 shebang is because Code Runner has a setting that it respects the file's shebang (code-runner.respectShebang) which is true by default.
If you don't want this extra step of setting-up Code Runner, you can simply disable (or uninstall it). All the steps you already did (setting python.pythonPath, selecting the interpreter, and clicking that Play button) would have already worked just fine with Microsoft's Python extension. See the official docs on running Python files, selecting environments, and debugging.

Pyinstaller does not properly include numpy in Mac bundle

If I include numpy in my script the bundle application does not even open. However, if I run the application from the console everything is fine. So:
pyinstaller -w myScript.spec
with import numpy as np in one of the modules does not create a proper executable. However:
python3.7 myScript.py
runs without problems. Even more, if I comment the import numpy as np line the executable is created without problem. I have also used numpy in another console-only script without problems.
So, how can I make PyInstaller include numpy in the bundle app?
I checked the warn-myScript.txt file from PyInstaller and there are lots of modules from numpy.core that are not found, for example: numpy.core.sqrt.
But I have no idea where to find these modules.
I tried doing what j4n7 suggested here, but it did not work.
I am using Python3.7, numpy 1.15.4 and PyInstaller 3.4
I installed Python from the Python web page and numpy and Pyinstaller using pip.
In a different computer I installed Python3.7 from homebrew and I have the same problem
I installed miniconda and then created an environment with numpy 1.15.4, Pyinstaller 3.4 and python3.7.1. Within the environment, I can create the bundle app with no problem.
However, the bundle app goes to 600MB. I will start a new question regarding how to reduce the size of the bundle app.

PyCharm Import Error: Claims 'matplotlib' is not a package, but works successfully in IDLE

Happy October everyone,
I've successfully downloaded modules before using either the pycharm installer or pip through the command screen, but for some reason when installing matplotlib pycharm cannot recognize it. I've uninstalled and reinstalled, I've installed through both methods, I've followed past similar questions asked on this site which make sure that you have the same interpreter and that it was installed in the right folder (pycharm error while importing, even though it works in the terminal).
So, here's the whole problem. Here's is the simple code, submitted into both pycharm and IDLE:
import matplotlib.pyplot as plt
plt.plot([1,2,3],[2,1,3])
plt.show()
When submitted into IDLE, my plot appears. When submitted into pycharm, the following error appears:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/PythonProject/matplotlib.py", line 1, in <module>
import matplotlib.pyplot as plt
File "C:\PythonProject\matplotlib.py", line 1, in <module>
import matplotlib.pyplot as plt
ImportError: No module named 'matplotlib.pyplot'; 'matplotlib' is not a package
I am currently running Python 3.4, PyCharm 2016.2.3, and my matplotlib folders are indeed in my site-packages folder inside my Python34 folder. Also for further verification:
PyCharm installation
Please help I've become frustrated since this is the only module I've run into trouble with. I've scoured StackOverflow and related websites to help, I've made sure I have all the requirements, etc.
I guess if you named your current writing python module as matplotlib.py.That cause the python load your current writing module instead of the actual matplotlib.py, which triggers an error.
I recommend you to use virtualenv. Is not strictly necessary but is good for dividing your project environments.
This is how I tested matplotlib on my Windows 10 installation, hope it helps.
Be sure that you have the python 3 installation folder listed in your Windows PATH environment variable, should be already listed if you checked "Add Python 3.5 to PATH":
You need also to set the Scripts folder in your PATH environment variable usually should be this path:
C:\Users\<your username>\AppData\Local\Programs\Python\Python35\Scripts
If you don't do that you have to prepend python -m to every command below like this: python -m <command>, so the command below would be python -m pip install virtualenv. I prefer the first solution.
To test matplotlib on Pycharm I've used virtualenv, here is how; first install virtualenv:
pip install virtualenv
Then you create your virtual environment in a folder of your choice, in my case I used python_3_env_00:
virtualenv python_3_env_00
After that you can activate you python 3 virtual environment:
python_3_env_00/Scripts/activate.bat
Now you should see in your command line the active virtual environment (python_3_venv_00), like this:
Now you can install matplotlib:
pip install matplotlib
Fire up PyCharm and add your virtual environment as you project interpreter, go to File->Settings search for Project Interpreter click on the gear icon and Add Local and set the path of your virtual environment, should look like something like this:
Test it:
import sys
print(sys.path)
run this code in where the import worked, and run it in the Pycharm project. Compare the lists. Find out which path that is not present in Pycharm sys.path.
Before importing pyplot, append the missing path to sys.path.
import sys
sys.path.append("the path")
import matplotlib.pyplot as plt
Does this work?
Please follow below steps if you are still getting an error:
If you are using PyCharm, it automatically create virtualenv.
Ensure Scripts path is set into PATH
C:\Users\<Username>\AppData\Local\Programs\Python\Python37-32
Then open PyCharm and go to File-> settings. Search for Project Interpreter. You will see window like this
sample image
Click on setting icon -> Existing Environment -> click on ... give below path
C:\Users\Krunal\AppData\Local\Programs\Python\Python37-32\python.exe
Click on Apply -> ok and you are good to go.
After installing matplotlib When I was trying to use matplotlib.pyplot it was giving error module not found.
I browsed some white papers and found out that we also need to install scipy library to use the matplotlib so I used the below in my command prompt
python -mpip install scipy
Restarted my kernel session.
It worked!!!
I was also facing issue while importing matplotlib but it got resolved and now I am able to use it from pycharm as well.
Please make sure you should have visual c++ 14 installed in your system.
2.If you have more than two python version installed on your system then please install matplotlib from both the version.
Eg. pip install matplotlib
pip3 install matplotlib
If matplotlib is working from python idle then please check whether you are using correct interpreter in pycharm or not and try to choose pythonw.exe path from your installed location.
Hope this will help, Please do let me know if you are still facing issue.
I had similar issue but I solved it very easily on pycharm 2019.3.2. In case anyone looking for an easier solution:
I just opened the terminal window on pycharm and typed pip install matplotlib and it was all good to go. Every project has its own virtual environment. Opening terminal window of IDE cds to project directory by default. So the installing command was enough.

Resources