Integrate PySpark with Jupyter Notebook - apache-spark

I'm following this site to install Jupyter Notebook, PySpark and integrate both.
When I needed to create the "Jupyter profile", I read that "Jupyter profiles" not longer exist. So I continue executing the following lines.
$ mkdir -p ~/.ipython/kernels/pyspark
$ touch ~/.ipython/kernels/pyspark/kernel.json
I opened kernel.json and write the following:
{
"display_name": "pySpark",
"language": "python",
"argv": [
"/usr/bin/python",
"-m",
"IPython.kernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7",
"PYTHONPATH": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python:/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip",
"PYTHONSTARTUP": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "pyspark-shell"
}
}
The paths of Spark are correct.
But then, when I run jupyter console --kernel pyspark I get this output:
MacBook:~ Agus$ jupyter console --kernel pyspark
/usr/bin/python: No module named IPython
Traceback (most recent call last):
File "/usr/local/bin/jupyter-console", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python2.7/site-packages/jupyter_core/application.py", line 267, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "/usr/local/lib/python2.7/site-packages/traitlets/config/application.py", line 595, in launch_instance
app.initialize(argv)
File "<decorator-gen-113>", line 2, in initialize
File "/usr/local/lib/python2.7/site-packages/traitlets/config/application.py", line 74, in catch_config_error
return method(app, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/jupyter_console/app.py", line 137, in initialize
self.init_shell()
File "/usr/local/lib/python2.7/site-packages/jupyter_console/app.py", line 110, in init_shell
client=self.kernel_client,
File "/usr/local/lib/python2.7/site-packages/traitlets/config/configurable.py", line 412, in instance
inst = cls(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/jupyter_console/ptshell.py", line 251, in __init__
self.init_kernel_info()
File "/usr/local/lib/python2.7/site-packages/jupyter_console/ptshell.py", line 305, in init_kernel_info
raise RuntimeError("Kernel didn't respond to kernel_info_request")
RuntimeError: Kernel didn't respond to kernel_info_request

Many ways to integrate pyspark with jupyter notebook.
1.Install Apache Toree.
pip install jupyter
pip install toree
jupyter toree install --spark_home=path/to/your/spark_directory --interpreters=PySpark
you can check installation by
jupyter kernelspec list
you will get an entry for toree pyspark kernel
apache_toree_pyspark /home/pauli/.local/share/jupyter/kernels/apache_toree_pyspark
Afterwards if you want, you can install other intepreters like SparkR,Scala,SQL
jupyter toree install --interpreters=Scala,SparkR,SQL
2.Add these lines to bashrc
export SPARK_HOME=/path to /spark-2.2.0
export PATH="$PATH:$SPARK_HOME/bin"
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
type pyspark in terminal and it will open a jupyter notebook with sparkcontext initialized.
Install pyspark only as a python package
pip install pyspark
Now you can import pyspark like another python package.

The easiest way is to use findspark. First create an environment variable:
export SPARK_HOME="{full path to Spark}"
And then install findspark:
pip install findspark
Then launch jupyter notebook and the following should work:
import findspark
findspark.init()
import pyspark

Related

ModuleNotFoundError: Shipping Python modules to Spark cluster nodes using conda environment

Spark version: 3.2.1
pySpark script was not able to resolve python modules in conda environment.
Any help is really appreciated
Detailed steps...
conda env prepared on laptop
conda create -y -n myconda-env -c conda-forge -c anaconda python=3.8 conda-pack grpcio
conda activate myconda-Env
python -m pip install PyHamcrest
python -m pip install --upgrade google-api-python-client
conda pack -f -o myconda-env.tar.gz
Uploaded myconda-env.tar.gz to my Notebook environment in the cloud
Create SparkSession with conda environment
conf = SparkConf()
conf.setMaster("SPARK_URL")
conf.setAppName("MyApp")
conf.set('spark.submit.deployMode', 'client')
# Conda environment
conf.set('spark.archives', 'myconda-env.tar.gz')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
pySspark script
# Python script files import modules in conda environment
spark.sparkContext.addPyFile("searchproto.zip")
We get below error when python scripts in searchproto.zip try to import modules in conda
y4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 7) (10.201.37.44 executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/worker.py", line 619, in main
process()
File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/worker.py", line 611, in process
serializer.dump_stream(out_iter, outfile)
File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/local/spark/python/pyspark/rdd.py", line 1562, in takeUpToNumLeft
File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/util.py", line 74, in wrapper
return f(*args, **kwargs)
File "", line 1, in
File "", line 8, in rdd_func
File "", line 259, in load_module
File "./searchproto.zip/searchproto/StudentEvents_pb2.py", line 5, in
from google.protobuf import descriptor as _descriptor
ModuleNotFoundError: No module named 'google'
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:555)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:713

How to solve "jupyter_client.kernelspec.NoSuchKernel: No such kernel named iqsharp" in python?

I was trying to print hello world in Microsoft's Q#.
When I ran the code It's shows like this.
Even I installed the package. I don't know where the problem is.
I think it's in jupyter notebook's permission or something.
Thanks for your help.
Here is my simple code
import qsharp
from HelloWorld import SayHello
SayHello.simulate()
Here is my output
File "d:\Program Files\Quantum Projects\hello_quantum.py", line 1, in <module>
import qsharp
File "C:\Users\ELCOT\AppData\Local\Programs\Python\Python36\lib\site-packages\qsharp\__init__.py", line 119, in <module>
client = _start_client()
File "C:\Users\ELCOT\AppData\Local\Programs\Python\Python36\lib\site-packages\qsharp\clients\__init__.py", line 28, in _start_client
client.start()
File "C:\Users\ELCOT\AppData\Local\Programs\Python\Python36\lib\site-packages\qsharp\clients\iqsharp.py", line 75, in start
self.kernel_manager.start_kernel(extra_arguments=["--user-agent", f"qsharp.py{user_agent_extra}"])
File "C:\Users\ELCOT\AppData\Local\Programs\Python\Python36\lib\site-packages\jupyter_client\manager.py", line 246, in start_kernel
kernel_cmd = self.format_kernel_cmd(extra_arguments=extra_arguments)
File "C:\Users\ELCOT\AppData\Local\Programs\Python\Python36\lib\site-packages\jupyter_client\manager.py", line 170, in format_kernel_cmd
cmd = self.kernel_spec.argv + extra_arguments
File "C:\Users\ELCOT\AppData\Local\Programs\Python\Python36\lib\site-packages\jupyter_client\manager.py", line 82, in kernel_spec
self._kernel_spec = self.kernel_spec_manager.get_kernel_spec(self.kernel_name)
File "C:\Users\ELCOT\AppData\Local\Programs\Python\Python36\lib\site-packages\jupyter_client\kernelspec.py", line 236, in get_kernel_spec
raise NoSuchKernel(kernel_name)
jupyter_client.kernelspec.NoSuchKernel: No such kernel named iqsharp
I was running into the same issue when setting my conda environment from a script:
#!/bin/sh
. $CONDA_PREFIX/etc/profile.d/conda.sh
# assumes you followed this doc and have run:
# conda create -n qsharp-env -c quantum-engineering qsharp notebook
conda activate qsharp-env
I solved it by installing IQ# as instructed by the documentation (slightly different from Ryan Shaffer's comment):
dotnet tool install -g Microsoft.Quantum.IQSharp
dotnet iqsharp install
I was then able to run python -c "import qsharp" and more complex Python programs.

Docker, Python, and "import logging"

I have trying to build a docker image/container that is Python 3.8.2 and that will run a specific script when ran. I can get it to build but when I try and run it breaks.
Specifically, I am running into a problem where:
...
import grpc
...
gives the error:
Traceback (most recent call last):
File "/project1/script1.py", line 4, in <module>
import grpc
File "/usr/local/lib/python3.8/site-packages/grpc/__init__.py", line 23, in <module>
from grpc._cython import cygrpc as _cygrpc
File "src/python/grpcio/grpc/_cython/cygrpc.pyx", line 27, in init grpc._cython.cygrpc
File "/usr/local/lib/python3.8/asyncio/__init__.py", line 8, in <module>
from .base_events import *
File "/usr/local/lib/python3.8/asyncio/base_events.py", line 18, in <module>
import concurrent.futures
File "/usr/local/lib/python3.8/concurrent/futures/__init__.py", line 8, in <module>
from concurrent.futures._base import (FIRST_COMPLETED,
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 42, in <module>
LOGGER = logging.getLogger("concurrent.futures")
AttributeError: module 'logging' has no attribute 'getLogger'
HOWEVER,
if i use the
docker run -it <image name> sh
to run the image, if I try to run pip in the Docker containers command line I get the same error.
Here is the Dockerfile used to create the image:
FROM python:3.8.2
#Build the Python Environments
COPY requirements.txt /tmp/
RUN pip install -r /tmp/requirements.txt
RUN pip install "obspy==1.2.1"
#COPY project1
COPY project1 /
#Update PYTHON PATH to include project1
ENV PYTHONPATH "${PYTHONPATH}:/project1"
#Run the script1.py script
CMD python /project1/script1.py
All of this runs fine on my windows box in a Python 3.8.2 environment that contains all the packages in requirements.txt
Anyone have any suggestions on what is going on?
May be try with a different python version image.
Try mentioning the specific version of the dependency in requirements.txt file
Is there any python file named logging.py or module folder named logging in your project? If so try give it another name.

Setting up Hydrogen and Atom with Anaconda managing python installation

I have added my python 3 executable to the system PATH (against the advice of Anaconda) to try and get Hydrogen (and really any Atom extension/plugin) to run lines or blocks of code in Atom. The 'scripts' atom plugin appears to work (I select some code and enter ctrl-shift-b), but I'd love to use more of the features in Hydrogen. When I execute (for example):
print('hello world')
I get the following error:
Python 3
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec)
File "C:\ProgramData\Anaconda3\lib\runpy.py", line 85, in _run_code exec(code, run_globals)
File "C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py", line 15, in from ipykernel import kernelapp as app
File "C:\ProgramData\Anaconda3\lib\site-packages\ipykernel__init__.py", line 2, in from .connect import
File "C:\ProgramData\Anaconda3\lib\site-packages\ipykernel\connect.py", line 18, in import jupyter_client File "C:\ProgramData\Anaconda3\lib\site-packages\jupyter_client__init__.py", line 4, in from .connect import
File "C:\ProgramData\Anaconda3\lib\site-packages\jupyter_client\connect.py", line 23, in import zmq
File "C:\ProgramData\Anaconda3\lib\site-packages\zmq__init__.py", line 47, in from zmq import backend
File "C:\ProgramData\Anaconda3\lib\site-packages\zmq\backend__init__.py", line 40, in reraise(*exc_info)
File "C:\ProgramData\Anaconda3\lib\site-packages\zmq\utils\sixcerpt.py", line 34, in reraise raise value
File "C:\ProgramData\Anaconda3\lib\site-packages\zmq\backend__init__.py", line 27, in _ns = select_backend(first)
File "C:\ProgramData\Anaconda3\lib\site-packages\zmq\backend\select.py", line 27, in select_backend mod = import(name, fromlist=public_api)
File "C:\ProgramData\Anaconda3\lib\site-packages\zmq\backend\cython__init__.py", line 6, in from . import (constants, error, message, context,
ImportError: DLL load failed: The specified module could not be found.
I have reinstalled the package, I've tried using the Anaconda power shell and normal prompt to install and load Atom. My only guess is that its having trouble launching a kernel, or am I supposed to launch one and then connect?
End goal:
run code block and it works.
To use a Conda env as a kernel in Hydrogen you must register the env using ipykernel, e.g.,
conda activate myenv
python -m ipykernel install --user
This creates an entry for the kernel in a default user-level location that is generically visible to any Jupyter instances run by the user (such as Hydrogen). It is recommended to also include a --name NAME flag to distinguish your different envs. Please refer to the python -m ipykernel install -h for more options.
Also, note that the minimum requirement for using a Conda env as a kernel is to install ipykernel. And, yeah, clean up the PATH so that it conforms to Conda best practices - there should be no need for manually editing.

jupyter notebook / Failed to start the kernel due the KeyError

My jupyter notebook doesn't start due the dead kernel with following Kernel error:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tornado/web.py", line 1512, in _execute
result = yield result
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
yielded = self.gen.send(value)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/notebook/services/sessions/handlers.py", line 67, in post
model = yield gen.maybe_future(sm.get_session(path=path))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/notebook/services/sessions/sessionmanager.py", line 170, in get_session
return self.row_to_model(row)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/notebook/services/sessions/sessionmanager.py", line 209, in row_to_model
raise KeyError
KeyError
In my case, the issue originated from having prompt-toolkit requirement conflict in jupyter-console and ipython.
You can use pip check to make sure whether you have the same problem. If the output is something similar to my output below, you have to fix the broken packages issue.
>>> pip check ipython
ipython 5.0.0 has requirement prompt-toolkit<2.0.0,>=1.0.3, but you'll have prompt-toolkit 2.0.9 which is incompatible.
>>> pip check jupyter-console
jupyter-console 6.0.0 has requirement prompt-toolkit<2.1.0,>=2.0.0, but you'll have prompt-toolkit 1.0.15 which is incompatible.
The quick fix is to try the solution originally mentioned here.
pip uninstall prompt-toolkit
pip install prompt-toolkit==1.0.15
pip uninstall jupyter-console
pip install jupyter-console==5.2.0

Resources