Apache Toree to connect to a remote spark cluster - apache-spark

Is there a way to connect Apache Toree to a remote spark cluster? I see the common command is
jupyter toree install --spark_home=/usr/local/bin/apache-spark/
How can I go about using spark on a remote server without having to install locally?

There is indeed a way of getting Toree to connect to a remote Spark cluster.
The easiest way I've discovered is to clone the existing Toree Scala/Python kernel, and create a new Toree Scala/Python Remote kernel. That way you can have the choice of running locally or remotely.
Steps:
Make a copy of the existing kernel. On my particular Toree install, the path to the Kernels was located at: /usr/local/share/jupyter/kernels/, so I performed the following command:
cp -pr /usr/local/share/jupyter/kernels/apache_toree_scala/ /usr/local/share/jupyter/kernels/apache_toree_scala_remote/
Edit the new kernel.json file in /usr/local/share/jupyter/kernels/apache_toree_scala_remote/ and add the requisite Spark options to the __TOREE_SPARK_OPTS__ variable. Technically, only --master <path> is required, but you can also add --num-executors, --executor-memory, etc to the variable as well.
Restart Jupyter.
My kernel.json file looks like this:
{
"display_name": "Toree - Scala Remote",
"argv": [
"/usr/local/share/jupyter/kernels/apache_toree_scala_remote/bin/run.sh",
"--profile",
"{connection_file}"
],
"language": "scala",
"env": {
"PYTHONPATH": "/opt/spark/python:/opt/spark/python/lib/py4j-0.9-src.zip",
"SPARK_HOME": "/opt/spark",
"DEFAULT_INTERPRETER": "Scala",
"PYTHON_EXEC": "python",
"__TOREE_OPTS__": "",
"__TOREE_SPARK_OPTS__": "--master spark://192.168.0.255:7077 --deploy-mode client --num-executors 4 --executor-memory 4g --executor-cores 8 --packages com.databricks:spark-csv_2.10:1.4.0"
}
}

This is a possible example with some intuitive details for ANY remote cluster install. For my remote cluster, which is a Cloudera 5.9.2 these are specific steps. (You can also use this example to install with non-Cloudera clusters with some smart edits.)
With OS/X to build CDH version (skip if using a distribution):
Goto https://github.com/Myllyenko/incubator-toree and clone this repo
Download Docker
Setup 'signing' - It's been a some time since I set this up - you'll need to sign the build above. TBD
'new branch git', edit the .travis.xml, README.md, and build.sbt files to change 5.10.x to 5.9.2
Start Docker, cd within the make release dir, build with make release, wait, wait, sign 3 builds
Copy the file ./dist/toree-pip/toree-0.2.0-spark-1.6.0-cdh5.9.2.tar.gz to your spark-shell machine that can reach your YARN-controlled Spark cluster
Merge, commit, etc your repo to your master repo if this will be mission critical
Spark Machine Installs:
Warning: Some steps may need to be done as root as a last resort
Install pip / anaconda (see other docs)
Install Jupyter sudo pip install jupyter
Install toree sudo pip install toree-0.2.0-spark-1.6.0-cdh5.9.2 or use the apache-toree distribution
Configure Toree to run with Jupyter (example):
Edit & add to ~/.bash_profile
echo $PATH
PATH=$PATH:$HOME/bin
export PATH
echo $PATH
export CDH_SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
export SPARK_HOME=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/spark
export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib
export SPARK_CONF_DIR=/etc/spark/conf
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
PATH=$PATH:$SPARK_HOME/bin
export PATH
echo $PATH
export SPARK_PKGS=$(cat << END | xargs echo | sed 's/ /,/g'
com.databricks:spark-csv_2.10:1.5.0
END
)
export SPARK_JARS=$(cat << END | xargs echo | sed 's/ /,/g'
/home/mymachine/extras/someapp.jar
/home/mymachine/extras/jsoup-1.10.3.jar
END
)
export TOREE_JAR="/usr/local/share/jupyter/kernels/apache_toree_scala/lib/toree-assembly-0.2.0-spark-1.6.0-cdh5.9.2-incubating.jar"
export SPARK_OPTS="--master yarn-client --conf spark.yarn.config.gatewayPath=/opt/cloudera/parcels --conf spark.scheduler.mode=FAIR --conf spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop --conf spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop --conf spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop --conf spark.yarn.historyServer.address=http://yourCDHcluster.net:18088 --conf spark.default.parallelism=20 --conf spark.driver.maxResultSize=1g --conf spark.driver.memory=1g --conf spark.executor.cores=4 --conf spark.executor.instances=5 --conf spark.executor.memory=1g --packages $SPARK_PKGS --jars $SPARK_JARS"
function jti() {
jupyter toree install \
--replace \
--user \
--kernel_name="CDH 5.9.2 Toree" \
--debug \
--spark_home=${SPARK_HOME} \
--spark_opts="$SPARK_OPTS" \
--log-level=0
}
function jn() {
jupyter notebook --ip=127.0.0.1 --port=8888 --debug --log-level=0
}
If you want a different port to hit Toree - now is your chance to edit 8888
Log out of your Toree / spark-shell machine
ssh back to that machine ssh -L 8888:localhost:8888 toreebox.cdhcluster.net (assuming that 8888 is the port in the bash file)
I expect as a user (not root) you can type jti to install Toree into Jupyter (Note: understanding this step may help to install other kernels into Jupyter - sidebar: #jamcom mentioned
the produced file, but this step automatically produces this part. The file is buried in your home dir's tree as a user rather than root.
As user, type jn to start a Jupyter Notebook. Wait a few seconds until the browser url is available and paste that URL into your browser.
You now have Jupyter running and so pick a new CDH 5.9.2 Toree or the version you installed. This launches a new browser window. Since you have some Toree experience, pick something like sc.getConf.getAll.sortWith(_._1 < _._1).foreach(println) in order to get the lazily instantiated spark context going. Be really patient as your jobs is submitted to the cluster and your may have to wait a long time if your cluster is busy or a little while for your job to process in the cluster.
Tips and Tricks:
I ran into an issue on the first run and the subsequent runs never saw that issue. (The issue issue might be fixed in the github)
Sometimes, I have to kill the old 'Apache Toree' app on YARN to start a new Toree.
Sometimes, my VM can has an orphaned JVM. If you get memory errors starting a Jupyter Notebook/ Toree or have unexpectedly disconnected, check your process list with top. And ... kill the extra JVM (be careful ID-ing your lost process).

Related

How to get basic Spark program running on Kubernetes

I'm trying to get off the ground with Spark and Kubernetes but I'm facing difficulties. I used the helm chart here:
https://github.com/bitnami/charts/tree/main/bitnami/spark
I have 3 workers and they all report running successfully. I'm trying to run the following program remotely:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("spark://<master-ip>:<master-port>").getOrCreate()
df = spark.read.json('people.json')
Here's the part that's not entirely clear. Where should the file people.json actually live? I have it locally where I'm running the python code and I also have it on a PVC that the master and all workers can see at /sparkdata/people.json.
When I run the 3rd line as simply 'people.json' then it starts running but errors out with:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
If I run it as '/sparkdata/people.json' then I get
pyspark.sql.utils.AnalysisException: Path does not exist: file:/sparkdata/people.json
Not sure where I go from here. To be clear I want it to read files from the PVC. It's an NFS share that has the data files on it.
Your people.json file needs to be accessible to your driver + executor pods. This can be achieved in multiple ways:
having some kind of network/cloud drive that each pod can access
mounting volumes on your pods, and then uploading the data to those volumes using --files in your spark-submit.
The latter option might be the simpler to set up. This page discusses in more detail how you could do this, but we can shortly go to the point. If you add the following arguments to your spark-submit you should be able to get your people.json on your driver + executors (you just have to choose sensible values for the $VAR variables in there):
--files people.json \
--conf spark.kubernetes.file.upload.path=$SOURCE_DIR \
--conf spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH \
--conf spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH \
--conf spark.kubernetes.executor.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH \
--conf spark.kubernetes.executor.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH \
You can always verify the existence of your data by going inside of the pods themselves like so:
kubectl exec -it <driver/executor pod name> bash
(now you should be inside of a bash process in the pod)
cd <mount-path-you-chose>
ls -al
That last ls -al command should show you a people.json file in there (after having done your spark-submit of course).
Hope this helps!

AWS EMR - ModuleNotFoundError: No module named 'arrow'

i'm running into this issue when trying to upgrade to Python3.9 for our EMR jobs using Pyspark 3.0.1/ EMR release 6.2.1. I've created the EMR Cluster using a bootstrap script and here are spark environment variables that were set:
export PYSPARK_PYTHON=/usr/local/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3
export SPARK_HOME=/usr/lib/spark
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip
I've installed all the application dependency libs using a shell script and are located in /home/ec2-user. However, when I try to spark submit a job with following command by user hadoop, i'm seeing the "ModuleNotFoundError".
Spark-submit cmd:
/bin/sh -c "MYAPP_ENV=dev PYSPARK_PYTHON=/usr/local/bin/python3 PYTHONHASHSEED=0 SETUPTOOLS_USE_DISTUTILS=stdlib spark-submit --master yarn --deploy-mode client --jars /home/hadoop/ext_lib/*.jar --py-files /home/hadoop/myapp.zip --conf spark.sql.parquet.compression.codec=gzip --conf spark.executorEnv.MYAPP_ENV=dev /home/hadoop/myapp/oasis/etl/spark/daily/run_daily_etl.py '--lookback_days' '1' '--s3_file_system' 's3'"
Error: ModuleNotFoundError: No module named 'arrow'
However, the same works when we use the EMR cluster settings with "EMR Release label:emr-5.28.0 and Spark 2.4.4.
Can someone provide help on identifying the cause as I'm fully stuck with this. I suspect it may be due to the access of ec2-user home folder from hadoop user.
Thanks

spark docker-image-tool Cannot find docker image

I deployed spark on kuberenets
helm install microsoft/spark --version 1.0.0 (also tried bitnami chart with the same result)
then, as is described https://spark.apache.org/docs/latest/running-on-kubernetes.html#submitting-applications-to-kubernetes
i go to $SPARK_HOME/bin
docker-image-tool.sh -r -t my-tag build
this returns
Cannot find docker image. This script must be run from a runnable distribution of Apache Spark.
but all spark runnables are in this directory.
bash-4.4# cd $SPARK_HOME/bin
bash-4.4# ls
beeline find-spark-home.cmd pyspark.cmd spark-class spark-shell.cmd spark-sql2.cmd sparkR
beeline.cmd load-spark-env.cmd pyspark2.cmd spark-class.cmd spark-shell2.cmd spark-submit sparkR.cmd
docker-image-tool.sh load-spark-env.sh run-example spark-class2.cmd spark-sql spark-submit.cmd sparkR2.cmd
find-spark-home pyspark run-example.cmd spark-shell spark-sql.cmd spark-submit2.cmd
any suggestions what am i doing wrong?
i haven't made any other configurations with spark, am i missing something? should i install docker myself, or any other tools?
You are mixing things here.
When you run helm install microsoft/spark --version 1.0.0 you're deploying Spark with all pre-requisites inside Kubernetes. Helm is doing all hard work for you. After you run this, Spark is ready to use.
Than after you deploy Spark using Helm you are trying to deploy Spark from inside a Spark pod that is already running on Kubernetes.
These are two different things that are not meant to be mixed. This guide is explaining how to run Spark on Kubernetes by hand but fortunately it can be done using Helm as you did before.
When you run helm install myspark microsoft/spark --version 1.0.0, the output is telling you how to access your spark webui:
NAME: myspark
LAST DEPLOYED: Wed Apr 8 08:01:39 2020
NAMESPACE: default
STATUS: deployed
REVISION: 1
NOTES:
1. Get the Spark URL to visit by running these commands in the same shell:
NOTE: It may take a few minutes for the LoadBalancer IP to be available.
You can watch the status of by running 'kubectl get svc --namespace default -w myspark-webui'
export SPARK_SERVICE_IP=$(kubectl get svc --namespace default myspark-webui -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo http://$SPARK_SERVICE_IP:8080
2. Get the Zeppelin URL to visit by running these commands in the same shell:
NOTE: It may take a few minutes for the LoadBalancer IP to be available.
You can watch the status of by running 'kubectl get svc --namespace default -w myspark-zeppelin'
export ZEPPELIN_SERVICE_IP=$(kubectl get svc --namespace default myspark-zeppelin -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo http://$ZEPPELIN_SERVICE_IP:8080
Let's check it:
$ export SPARK_SERVICE_IP=$(kubectl get svc --namespace default myspark-webui -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
$ echo http://$SPARK_SERVICE_IP:8080
http://34.70.212.182:8080
If you open this URL you have your Spark webui ready.

Submit an application to a standalone spark cluster running in GCP from Python notebook

I am trying to submit a spark application to a standalone spark(2.1.1) cluster 3 VM running in GCP from my Python 3 notebook(running in local laptop) but for some reason spark session is throwing error "StandaloneAppClient$ClientEndpoint: Failed to connect to master sparkmaster:7077".
Environment Details: IPython and Spark Master are running in one GCP VM called "sparkmaster". 3 additional GCP VMs are running Spark workers and Cassandra Clusters. I connect from my local laptop(MBP) using Chrome to GCP VM IPython notebook in "sparkmaster"
Please note that terminal works:
bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.1 --master spark://sparkmaster:7077 ex.py 1000
Running it from Python Notebook:
import os
os.environ["PYSPARK_SUBMIT_ARGS"] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.1 pyspark-shell'
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark=SparkSession.builder.master("spark://sparkmaster:7077").appName('somatic').getOrCreate() #This step works if make .master('local')
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka1:9092,kafka2:9092,kafka3:9092") \
.option("subscribe", "gene") \
.load()
so far I have tried these:
I have tried to change spark master node spark-defaults.conf and spark-env.sh to add SPARK_MASTER_IP.
Tried to find the STANDALONE_SPARK_MASTER_HOST=hostname -f setting so that I can remove "-f". For some reason my spark master ui shows FQDN:7077 not hostname:7077
passed FQDN as param to .master() and os.environ["PYSPARK_SUBMIT_ARGS"]
Please let me know if you need more details.
After doing some more research I was able to resolve the conflict. It was due to a simple environment variable called SPARK_HOME. In my case it was pointing to Conda's /bin(pyspark was running from this location) whereas my spark setup was present in a diff. path. The simple fix was to add
export SPARK_HOME="/home/<<your location path>>/spark/" to .bashrc file( I want this to be attached to my profile not to the spark session)
How I have done it:
Step 1: ssh to master node in my case it was same as ipython kernel/server VM in GCP
Step 2:
cd ~
sudo nano .bashrc
scroll down to the last line and paste the below line
export SPARK_HOME="/home/your/path/to/spark-2.1.1-bin-hadoop2.7/"
ctrlX and Y and enter to save the changes
Note: I have also added few more details to the environment section for clarity.

Running python in yarn/spark cluster mode using virtualenv

My python app on yarn/spark does not recognize the requirements.txt file to create a virtualenv on the worker nodes, and continues to use the global environment. Any help to fix this would be much appreciated.
Spark version: 2.0.1
submit script after running pip freeze > requirements-test.txt from within the virtual environment that I want to recreate at the nodes:
/usr/bin/spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/mnt/py_env/requirements-test.txt --conf spark.pyspark.virtualenv.bin.path=/mnt/anaconda2/bin/virtualenv --conf spark.pyspark.python=/mnt/py_env/test/bin/python /home/hadoop/python/spark_virtualenv.py
My requirements-test.txt file:
dill==0.2.7.1
Lifetimes==0.8.0.0
numpy==1.13.1
pandas==0.20.3
python-dateutil==2.6.1
pytz==2017.2
scipy==0.19.1
six==1.10.0
My /home/hadoop/python/spark_virtualenv.py:
from pyspark import SparkContext
#import lifetimes
if __name__ == "__main__":
sc = SparkContext(appName="Simple App")
import numpy as np
sc.parallelize(range(1,10)).map(lambda x : np.__version__).collect()
print "//////////// works! //////////"
#print lifetimes.__version__
print np.__file__
From the output, I see that it is still importing only my global numpy package and not the one in the virtual environment:
//////////// works! //////////
/mnt/anaconda2/lib/python2.7/site-packages/numpy/__init__.pyc
PS: I have anaconda2 installed on all nodes of my cluster
One other point: If my spark-submit option is changed to --deploy-mode cluster then the output is different:
//////////// works! //////////
/usr/local/lib64/python2.7/site-packages/numpy/__init__.pyc
Anaconda might have a preferred way of doing it through Conda, but one thought is to add the lifetimes utils.py, estimation.py, etc all the files from the package with the lines:
SparkContext.addPyFile("/fully/articulated/path/file.py")

Resources