I have a running instance of docker with zeppelin/spark environment configured. I would like to know if there is a possibility to run spark-shell process for example with
docker exec -it CONTAINTER spark-shell
and then connect to created spark context with zeppelin notebook. I've seen that there is a "Connect to existing process" checkbox in the Zeppelin interpreters page, however I'm not sure how to use it.
For example, in Jupyter notebook there is a possibility to connect ipython console to an existing Jupyter kernel such that I have access to the notebook variables from ipython console. I'm wondering if there is a similar functionality in case of Zeppeling and spark-shell or maybe for some reasons this is not possible.
Thank you in advance.
KK
Related
I am following this guide to run up a zeppelin container in a local kubernetes cluster set up using minikube.
https://zeppelin.apache.org/docs/0.9.0-SNAPSHOT/quickstart/kubernetes.html
I am able to set up zeppelin and run some sample code there. I have downloaded spark 2.4.5 & 2.4.0 source code and built it for kubernetes support with the following command:
./build/mvn -Pkubernetes -DskipTests clean package
Once spark is built I created a docker container as explained in the article:
bin/docker-image-tool.sh -m -t 2.4.X build
I configured zeppelin to use the spark image which was built with kubernetes support. The article above explains that the spark interpreter will auto configure spark on kubernetes to run in client mode and run the job.
But whenever I try to run any parahgraph with spark I receive the following error
Exception in thread "main" java.lang.IllegalArgumentException: basedir must be absolute: ?/.ivy2/local
I tried setting the spark configuration spark.jars.ivy in zeppelin to point to a temp directory but that does not work either.
I found a similar issue here:
basedir must be absolute: ?/.ivy2/local
But I can't seem to configure spark to run with the spark.jars.ivy /tmp/.ivy config. I tried building spark with the spark-defaults.conf when building spark but that does not seems to be working either.
Quite stumped at this problem and how to solve it any guidance would be appreciated.
Thanks!
I have also run into this problem, but a work-around I used for setting spark.jars.ivy=/tmp/.ivy is to rather set it is as an environment variable.
In your spark interpreter settings, add the following property: SPARK_SUBMIT_OPTIONS and set its value to --conf spark.jars.ivy=/tmp/.ivy.
This should pass additional options to spark submit and your job should continue.
This question already has answers here:
Configuring Spark to work with Jupyter Notebook and Anaconda
(3 answers)
Closed 4 years ago.
It's been four days that I am struggling with this issue, I looked in several webpages dealing with the same issue even here in Stackoverflow but without getting a solution.
I installed Spark-2.3.0 , Scala 2.12.5 and Hadoop-2.7.1 (for winutils master) then set up the according environment variables. I installed findspark and then launch pyspark in my Jupyter Notebook. The issue is that when I run:
sc = pyspark.SparkContext('local')
I get the following error:
java gateway process exited before sending the driver its port number
I should mention that I'm using Java-1.8.0 and I set in my environment variables :
PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"
Please if you have any idea how I can solve this issue, I will be gratefull. Thank you!
The setup is fairly simple and straightforward. Below are steps that you can follow.
Assumed:
You have downloaded Spark and extracted its archive into <spark_home>, added the <spark_home>/bin directory to the PATH variable
You have installed Jupyter and it can be launched with jupyter notebook from the command line
Steps to be followed:
Export these two variables. This is best done in your user profile script
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
To open jupyter, all you have to do is call
pyspark
If you have additional options, such as master, you can pass them to pyspark:
pyspark --master local[2]
When the notebook opens, spark context is already initialized (as sc), and spark session too (as spark), and you should be able to see something like this:
Here are the steps I have taken so far:
I installed Anaconda3 and everything included on the directory $HOME/anaconda3/bin.
I cd'ed into $HOME/anaconda3/bin and ran the command ./conda install -c conda-forge pyspark. It was successful.
I didn't do anything else. More specifically, there are no variables set in my .bashrc
Here are some important details:
I am on a distributed cluster running Hadoop, so there might be other directories outside of my home folder that I have yet to discover but I might need. I also don't have admin access.
Jupyter notebook runs just fine.
Here is my goal:
Goal. To do something along the lines of adding variables or configuring some files so that I can run pyspark on Jupyter Notebook.
What are other steps I need to do after Step 3 in order to achieve this goal?
Since you have installed pyspark with conda, and as you say Jupyter notebook runs fine (presumably for the same Anaconda distribution), there are no further steps required - you should be able to open a new notebook and import pyspark.
Notice though that installing pyspark that way (i.e. with pip or conda) gives only limited functionality; from the package docs:
The Python packaging for Spark is not intended to replace all of the
other use cases. This Python packaged version of Spark is suitable for
interacting with an existing cluster (be it Spark standalone, YARN, or
Mesos) - but does not contain the tools required to setup your own
standalone Spark cluster. You can download the full version of Spark
from the Apache Spark downloads page.
Installing pyspark with pip or conda is a relatively recent add-on, aimed at the cases described in the docs above. I don't know what limitations you may face (have never tried it) but if you need the full functionality, you should download the full Spark distribution (of which pyspark is an integral part).
The intention is to achieve something along the lines of
jupyter-notebook --kernel-options="--mem 1024m --cpus 4"
Where kernel-options would be forwarded to the pyspark or spark kernels.
We need this in order to run separate jupyter servers - one for pyspark kernel and one for spark (in scala) kernel on the same machine. That is a requirement since a single jupyter server does not support simultaneous pyspark and (scala) spark kernels running concurrently.
For Jupyter 4.0 and later you should be able to start a Spark-enabled notebooks like this:
pyspark [options]
where [options] is the list of any flags you pass to pyspark.
For this to work, you would need to set following environmental variables in your .profile:
export PYSPARK_DRIVER_PYTHON="/path/to/my/bin/jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON="/path/to/my/bin/python"
Alternatively, if you are using Apache Toree, you could pass them via SPARK_OPTS:
SPARK_OPTS='--master=local[4]' jupyter notebook
More details on Apache Toree setup.
I want to use the spark-csv package from https://github.com/databricks/spark-csv from within the jupyter service running on Spark HDInsight cluster on Azure.
From local cluster I know I can do this like:
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
However I don't understand/know where to put this in the Azure spark configuration.. Any clues hints are appreciated.
You can use the %%configure magic to add any required external package.
It should be as simple as putting the following snippet in your first code cell.
%%configure
{ "packages":["com.databricks:spark-csv_2.10:1.4.0"] }
This specific example is also covered in the documentation. Just make sure you start Spark session after the %%configure cell.
One option for managing Spark packages in a cluster from a Jupyter notebook is Apache Toree. Toree gives you some extra line magics that allow you to manage Spark packages from within a Jupyter notebook. For example, inside a Jupyter scala notebook, you would install spark-csv with
%AddDeps com.databricks spark-csv_2.11 1.4.0 --transitive
To install Apache Toree on your Spark clusters, ssh into you Spark clusters and run,
sudo pip install --pre toree
sudo jupyter toree install \
--spark_home=$SPARK_HOME \
--interpreters=PySpark,SQL,Scala,SparkR
I know you specifically asked about Jupyter notebooks running PySpark. At this time, Apache Toree is an incubating project. I have run into trouble using the provided line magics with pyspark notebooks specifically. Maybe you will have better luck. I am looking into why this is, but personally, I prefer Scala in Spark. Hope this helps!
You can try to execute your two lines of code (export ...) in a script that you can invoke in Azure at the time of creation of the HDInsight cluster.
Since you are using HDInsight, you can use a "Script Action" on the Spark cluster load that imports the needed libraries. The script can be a very simple shell script and it can be automatically executed on startup, and automatically re-executed on new nodes if the cluster is resized.
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster-linux/