Spark version: 3.2.1
pySpark script was not able to resolve python modules in conda environment.
Any help is really appreciated
Detailed steps...
conda env prepared on laptop
conda create -y -n myconda-env -c conda-forge -c anaconda python=3.8 conda-pack grpcio
conda activate myconda-Env
python -m pip install PyHamcrest
python -m pip install --upgrade google-api-python-client
conda pack -f -o myconda-env.tar.gz
Uploaded myconda-env.tar.gz to my Notebook environment in the cloud
Create SparkSession with conda environment
conf = SparkConf()
conf.setMaster("SPARK_URL")
conf.setAppName("MyApp")
conf.set('spark.submit.deployMode', 'client')
# Conda environment
conf.set('spark.archives', 'myconda-env.tar.gz')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
pySspark script
# Python script files import modules in conda environment
spark.sparkContext.addPyFile("searchproto.zip")
We get below error when python scripts in searchproto.zip try to import modules in conda
y4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 7) (10.201.37.44 executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/worker.py", line 619, in main
process()
File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/worker.py", line 611, in process
serializer.dump_stream(out_iter, outfile)
File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/local/spark/python/pyspark/rdd.py", line 1562, in takeUpToNumLeft
File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/util.py", line 74, in wrapper
return f(*args, **kwargs)
File "", line 1, in
File "", line 8, in rdd_func
File "", line 259, in load_module
File "./searchproto.zip/searchproto/StudentEvents_pb2.py", line 5, in
from google.protobuf import descriptor as _descriptor
ModuleNotFoundError: No module named 'google'
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:555)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:713
Related
Currently I'm connecting to databricks with local VS Code via databricks-connect. But my submmission all comes with error of module not found, which means the code in other python files not found.
I tried:
Move code into the folder with main.py
import the file inside of the function that uses it
adding the file via sparkContext.addPyFile
Does anyone have any experiecen on it? Or the even better way to interact with databricks for python projects.
I seems my python part code is executed in local python env, only the code directlry related spark is in cluster, but the cluster does not load all my python files. then raising error.
I have file folder
main.py
lib222.py
__init__.py
with class Foo in lib222.py
main code is:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
#sc.setLogLevel("INFO")
print("Testing addPyFile isolation")
sc.addPyFile("lib222.py")
from lib222 import Foo
print(sc.parallelize(range(10)).map(lambda i: Foo(2)).collect())
But I got error of Module not find lib222.
Also when I print python version of some sys info, it seems the python code is executed in my local machine instead of remote driver.
My db version is 6.6.
Detailed Error:
> Exception has occurred: Py4JJavaError
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 10.139.64.8, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'lib222'
>
>During handling of the above exception, another exception occurred:
>
>Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 462, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/databricks/spark/python/pyspark/worker.py", line 71, in read_command
command = serializer._read_with_length(file)
File "/databricks/spark/python/pyspark/serializers.py", line 185, in _read_with_length
raise SerializationError("Caused by " + traceback.format_exc())
pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'lib222```
I use Databricks on AWS and the best practice I follow are as follows-
Uninstall PySpark from your local environment using pip or conda
Create a virtual environment on your local system with a python environment compatible with your Databricks runtime. Having a virtual environment gives you more control over your setup and avoid version conflicts.
conda create -n ENV_NAME python==PYTHON_VERSION
The minor version of your client Python installation must be the same as the minor Python version of your Databricks cluster (3.5, 3.6, or 3.7). Databricks Runtime 5.x has Python 3.5, Databricks Runtime 5.x ML has Python 3.6, and Databricks Runtime 6.1 and above and Databricks Runtime 6.1 ML and above have Python 3.7.
Note: Always use pip to install Pyspark as it points to the official release. Avoid conda or conda-forge for PySpark installation.
Follow steps in databricks-connect for configuring workspace- Official-document
On your databricks cluster check the existing version for Pyspark and its dependencies. If I am correct the version detail for dependencies for the latest PySpark code are as follows-
pandas 0.23.2
NumPy 1.7
pyarrow 0.15.1
Py4J 0.10.9
After running
python -m nuitka --plugin-enable=pylint-warnings --follow-imports --standalone sample.py
it completes the build without any error but when I run the build sample file from sample.dist directory it gives
Traceback (most recent call last):
File "[PATH TO PROJECT]/sample.dist/pkg_resources/__init__.py", line 359, in get_provider
KeyError: 'pyfiglet.fonts'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "[PATH TO PROJECT]/sample.dist/yarasilly2.py", line 183, in <module>
File "[PATH TO PROJECT]/sample.dist/pyfiglet/__init__.py", line 794, in __init__
File "[PATH TO PROJECT]/sample.dist/pyfiglet/__init__.py", line 801, in setFont
File "[PATH TO PROJECT]/sample.dist/pyfiglet/__init__.py", line 126, in __init__
File "[PATH TO PROJECT]/sample.dist/pyfiglet/__init__.py", line 136, in preloadFont
File "[PATH TO PROJECT]/sample.dist/pkg_resources/__init__.py", line 1134, in resource_exists
File "[PATH TO PROJECT]/sample.dist/pkg_resources/__init__.py", line 361, in get_provider
ModuleNotFoundError: No module named 'pyfiglet.fonts'
Nuitka version, full Python version and Platform (Windows, OSX, Linux ...)
python -m nuitka --version
0.6.8.4
Python: 3.8.3 (default, May 29 2020, 00:00:00)
Executable: [PATH TO PROJECT]/venv/bin/python
OS: Linux
Arch: x86_64
Nuitka Install
pip install nuitka
Sample piece of code
from pyfiglet import Figlet
if __name__ == '__main__':
f = Figlet(font='slant')
puts(colored.blue(f.renderText("Sample Text")))
Try the below code in the jupter notebooks
! pip install pyfiglet
Input the above code before you start your work..!
Let me know if that works....
And also support me, that brings motivation for further.
I met the same problem and after reading the user guide I found a solution.
python -m nuitka --plugin-enable=pylint-warnings --follow-imports --standalone sample.py --include-package=pyfiglet --include-data-dir={PATH-TO-PYFIGLET-LIB}=pyfiglet
My jupyter notebook doesn't start due the dead kernel with following Kernel error:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tornado/web.py", line 1512, in _execute
result = yield result
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
yielded = self.gen.send(value)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/notebook/services/sessions/handlers.py", line 67, in post
model = yield gen.maybe_future(sm.get_session(path=path))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/notebook/services/sessions/sessionmanager.py", line 170, in get_session
return self.row_to_model(row)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/notebook/services/sessions/sessionmanager.py", line 209, in row_to_model
raise KeyError
KeyError
In my case, the issue originated from having prompt-toolkit requirement conflict in jupyter-console and ipython.
You can use pip check to make sure whether you have the same problem. If the output is something similar to my output below, you have to fix the broken packages issue.
>>> pip check ipython
ipython 5.0.0 has requirement prompt-toolkit<2.0.0,>=1.0.3, but you'll have prompt-toolkit 2.0.9 which is incompatible.
>>> pip check jupyter-console
jupyter-console 6.0.0 has requirement prompt-toolkit<2.1.0,>=2.0.0, but you'll have prompt-toolkit 1.0.15 which is incompatible.
The quick fix is to try the solution originally mentioned here.
pip uninstall prompt-toolkit
pip install prompt-toolkit==1.0.15
pip uninstall jupyter-console
pip install jupyter-console==5.2.0
I have already have a virtual environment installation of Tensorflow on my computer, although it runs Python 2.7. Now I want to work with Tensorflow running in Python 3.5.
For Python 3, I've created a virtual environment since the default Python environment on my computer is Python 2.7. I'm attempting a Pip installation of Tensorflow in a Python 3 virtual environment that I have named py3k. The installation procedure throws errors that I'm finding difficulty in debugging.
Here's what I did:
anirudh#anirudh-Vostro-3445:~$ source activate py3k
(py3k) anirudh#anirudh-Vostro-3445:~$ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0rc0-cp35-cp35m-linux_x86_64.whl
(py3k) anirudh#anirudh-Vostro-3445:~$ sudo pip3 install --upgrade $TF_BINARY_URL
/usr/local/lib/python3.2/dist-packages/pip-8.1.2-py3.2.egg/pip/_vendor/pkg_resources/__init__.py:80: UserWarning: Support for Python 3.0-3.2 has been dropped. Future versions will fail here.
warnings.warn(msg)
Traceback (most recent call last):
File "/usr/local/bin/pip3", line 9, in <module>
load_entry_point('pip==8.1.2', 'console_scripts', 'pip3')()
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 337, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 2280, in load_entry_point
return ep.load()
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 1990, in load
entry = __import__(self.module_name, globals(),globals(), ['__name__'])
File "/usr/local/lib/python3.2/dist-packages/pip-8.1.2-py3.2.egg/pip/__init__.py", line 16, in <module>
from pip.vcs import git, mercurial, subversion, bazaar # noqa
File "/usr/local/lib/python3.2/dist-packages/pip-8.1.2-py3.2.egg/pip/vcs/mercurial.py", line 9, in <module>
from pip.download import path_to_url
File "/usr/local/lib/python3.2/dist-packages/pip-8.1.2-py3.2.egg/pip/download.py", line 36, in <module>
from pip.utils.ui import DownloadProgressBar, DownloadProgressSpinner
File "/usr/local/lib/python3.2/dist-packages/pip-8.1.2-py3.2.egg/pip/utils/ui.py", line 15, in <module>
from pip._vendor.progress.bar import Bar, IncrementalBar
File "/usr/local/lib/python3.2/dist-packages/pip-8.1.2-py3.2.egg/pip/_vendor/progress/bar.py", line 48
empty_fill = u'∙'
^
SyntaxError: invalid syntax
cp35 from https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0rc0-cp35-cp35m-linux_x86_64.whl means C python 3.5
You have python 3.2 on your machine
I recommend installing Ubuntu 15.04
I'm following this site to install Jupyter Notebook, PySpark and integrate both.
When I needed to create the "Jupyter profile", I read that "Jupyter profiles" not longer exist. So I continue executing the following lines.
$ mkdir -p ~/.ipython/kernels/pyspark
$ touch ~/.ipython/kernels/pyspark/kernel.json
I opened kernel.json and write the following:
{
"display_name": "pySpark",
"language": "python",
"argv": [
"/usr/bin/python",
"-m",
"IPython.kernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7",
"PYTHONPATH": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python:/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip",
"PYTHONSTARTUP": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "pyspark-shell"
}
}
The paths of Spark are correct.
But then, when I run jupyter console --kernel pyspark I get this output:
MacBook:~ Agus$ jupyter console --kernel pyspark
/usr/bin/python: No module named IPython
Traceback (most recent call last):
File "/usr/local/bin/jupyter-console", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python2.7/site-packages/jupyter_core/application.py", line 267, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "/usr/local/lib/python2.7/site-packages/traitlets/config/application.py", line 595, in launch_instance
app.initialize(argv)
File "<decorator-gen-113>", line 2, in initialize
File "/usr/local/lib/python2.7/site-packages/traitlets/config/application.py", line 74, in catch_config_error
return method(app, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/jupyter_console/app.py", line 137, in initialize
self.init_shell()
File "/usr/local/lib/python2.7/site-packages/jupyter_console/app.py", line 110, in init_shell
client=self.kernel_client,
File "/usr/local/lib/python2.7/site-packages/traitlets/config/configurable.py", line 412, in instance
inst = cls(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/jupyter_console/ptshell.py", line 251, in __init__
self.init_kernel_info()
File "/usr/local/lib/python2.7/site-packages/jupyter_console/ptshell.py", line 305, in init_kernel_info
raise RuntimeError("Kernel didn't respond to kernel_info_request")
RuntimeError: Kernel didn't respond to kernel_info_request
Many ways to integrate pyspark with jupyter notebook.
1.Install Apache Toree.
pip install jupyter
pip install toree
jupyter toree install --spark_home=path/to/your/spark_directory --interpreters=PySpark
you can check installation by
jupyter kernelspec list
you will get an entry for toree pyspark kernel
apache_toree_pyspark /home/pauli/.local/share/jupyter/kernels/apache_toree_pyspark
Afterwards if you want, you can install other intepreters like SparkR,Scala,SQL
jupyter toree install --interpreters=Scala,SparkR,SQL
2.Add these lines to bashrc
export SPARK_HOME=/path to /spark-2.2.0
export PATH="$PATH:$SPARK_HOME/bin"
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
type pyspark in terminal and it will open a jupyter notebook with sparkcontext initialized.
Install pyspark only as a python package
pip install pyspark
Now you can import pyspark like another python package.
The easiest way is to use findspark. First create an environment variable:
export SPARK_HOME="{full path to Spark}"
And then install findspark:
pip install findspark
Then launch jupyter notebook and the following should work:
import findspark
findspark.init()
import pyspark