spark docker-image-tool Cannot find docker image - apache-spark

I deployed spark on kuberenets
helm install microsoft/spark --version 1.0.0 (also tried bitnami chart with the same result)
then, as is described https://spark.apache.org/docs/latest/running-on-kubernetes.html#submitting-applications-to-kubernetes
i go to $SPARK_HOME/bin
docker-image-tool.sh -r -t my-tag build
this returns
Cannot find docker image. This script must be run from a runnable distribution of Apache Spark.
but all spark runnables are in this directory.
bash-4.4# cd $SPARK_HOME/bin
bash-4.4# ls
beeline find-spark-home.cmd pyspark.cmd spark-class spark-shell.cmd spark-sql2.cmd sparkR
beeline.cmd load-spark-env.cmd pyspark2.cmd spark-class.cmd spark-shell2.cmd spark-submit sparkR.cmd
docker-image-tool.sh load-spark-env.sh run-example spark-class2.cmd spark-sql spark-submit.cmd sparkR2.cmd
find-spark-home pyspark run-example.cmd spark-shell spark-sql.cmd spark-submit2.cmd
any suggestions what am i doing wrong?
i haven't made any other configurations with spark, am i missing something? should i install docker myself, or any other tools?

You are mixing things here.
When you run helm install microsoft/spark --version 1.0.0 you're deploying Spark with all pre-requisites inside Kubernetes. Helm is doing all hard work for you. After you run this, Spark is ready to use.
Than after you deploy Spark using Helm you are trying to deploy Spark from inside a Spark pod that is already running on Kubernetes.
These are two different things that are not meant to be mixed. This guide is explaining how to run Spark on Kubernetes by hand but fortunately it can be done using Helm as you did before.
When you run helm install myspark microsoft/spark --version 1.0.0, the output is telling you how to access your spark webui:
NAME: myspark
LAST DEPLOYED: Wed Apr 8 08:01:39 2020
NAMESPACE: default
STATUS: deployed
REVISION: 1
NOTES:
1. Get the Spark URL to visit by running these commands in the same shell:
NOTE: It may take a few minutes for the LoadBalancer IP to be available.
You can watch the status of by running 'kubectl get svc --namespace default -w myspark-webui'
export SPARK_SERVICE_IP=$(kubectl get svc --namespace default myspark-webui -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo http://$SPARK_SERVICE_IP:8080
2. Get the Zeppelin URL to visit by running these commands in the same shell:
NOTE: It may take a few minutes for the LoadBalancer IP to be available.
You can watch the status of by running 'kubectl get svc --namespace default -w myspark-zeppelin'
export ZEPPELIN_SERVICE_IP=$(kubectl get svc --namespace default myspark-zeppelin -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo http://$ZEPPELIN_SERVICE_IP:8080
Let's check it:
$ export SPARK_SERVICE_IP=$(kubectl get svc --namespace default myspark-webui -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
$ echo http://$SPARK_SERVICE_IP:8080
http://34.70.212.182:8080
If you open this URL you have your Spark webui ready.

Related

Helm - Spark operator examples/spark-pi.yaml does not exist

I've deployed Spark Operator to GKE using the Helm Chart to a custom namespace:
helm install --name sparkoperator incubator/sparkoperator --namespace custom-ns --set sparkJobNamespace=custom-ns
and confirmed the operator running in the cluster with helm status sparkoperator.
However when I'm trying to run the Spark Pi example kubectl apply -f examples/spark-pi.yaml I'm getting the following error:
the path "examples/spark-pi.yaml" does not exist
There are few things that I probably still don't get:
Where is actually examples/spark-pi.yaml located after deploying the operator?
What else should I check and what other steps should I take to make the example work?
Please find the spark-pi.yaml file here.
You should copy it to your filesystem, customize it if needed, and provide a valid path to it with kubectl apply -f path/to/spark-pi.yaml.
kubectl apply needs a yaml file either locally on the system where you are running kubectl command or it can be a http/https endpoint hosting the file.

Apache Toree to connect to a remote spark cluster

Is there a way to connect Apache Toree to a remote spark cluster? I see the common command is
jupyter toree install --spark_home=/usr/local/bin/apache-spark/
How can I go about using spark on a remote server without having to install locally?
There is indeed a way of getting Toree to connect to a remote Spark cluster.
The easiest way I've discovered is to clone the existing Toree Scala/Python kernel, and create a new Toree Scala/Python Remote kernel. That way you can have the choice of running locally or remotely.
Steps:
Make a copy of the existing kernel. On my particular Toree install, the path to the Kernels was located at: /usr/local/share/jupyter/kernels/, so I performed the following command:
cp -pr /usr/local/share/jupyter/kernels/apache_toree_scala/ /usr/local/share/jupyter/kernels/apache_toree_scala_remote/
Edit the new kernel.json file in /usr/local/share/jupyter/kernels/apache_toree_scala_remote/ and add the requisite Spark options to the __TOREE_SPARK_OPTS__ variable. Technically, only --master <path> is required, but you can also add --num-executors, --executor-memory, etc to the variable as well.
Restart Jupyter.
My kernel.json file looks like this:
{
"display_name": "Toree - Scala Remote",
"argv": [
"/usr/local/share/jupyter/kernels/apache_toree_scala_remote/bin/run.sh",
"--profile",
"{connection_file}"
],
"language": "scala",
"env": {
"PYTHONPATH": "/opt/spark/python:/opt/spark/python/lib/py4j-0.9-src.zip",
"SPARK_HOME": "/opt/spark",
"DEFAULT_INTERPRETER": "Scala",
"PYTHON_EXEC": "python",
"__TOREE_OPTS__": "",
"__TOREE_SPARK_OPTS__": "--master spark://192.168.0.255:7077 --deploy-mode client --num-executors 4 --executor-memory 4g --executor-cores 8 --packages com.databricks:spark-csv_2.10:1.4.0"
}
}
This is a possible example with some intuitive details for ANY remote cluster install. For my remote cluster, which is a Cloudera 5.9.2 these are specific steps. (You can also use this example to install with non-Cloudera clusters with some smart edits.)
With OS/X to build CDH version (skip if using a distribution):
Goto https://github.com/Myllyenko/incubator-toree and clone this repo
Download Docker
Setup 'signing' - It's been a some time since I set this up - you'll need to sign the build above. TBD
'new branch git', edit the .travis.xml, README.md, and build.sbt files to change 5.10.x to 5.9.2
Start Docker, cd within the make release dir, build with make release, wait, wait, sign 3 builds
Copy the file ./dist/toree-pip/toree-0.2.0-spark-1.6.0-cdh5.9.2.tar.gz to your spark-shell machine that can reach your YARN-controlled Spark cluster
Merge, commit, etc your repo to your master repo if this will be mission critical
Spark Machine Installs:
Warning: Some steps may need to be done as root as a last resort
Install pip / anaconda (see other docs)
Install Jupyter sudo pip install jupyter
Install toree sudo pip install toree-0.2.0-spark-1.6.0-cdh5.9.2 or use the apache-toree distribution
Configure Toree to run with Jupyter (example):
Edit & add to ~/.bash_profile
echo $PATH
PATH=$PATH:$HOME/bin
export PATH
echo $PATH
export CDH_SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
export SPARK_HOME=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/spark
export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib
export SPARK_CONF_DIR=/etc/spark/conf
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
PATH=$PATH:$SPARK_HOME/bin
export PATH
echo $PATH
export SPARK_PKGS=$(cat << END | xargs echo | sed 's/ /,/g'
com.databricks:spark-csv_2.10:1.5.0
END
)
export SPARK_JARS=$(cat << END | xargs echo | sed 's/ /,/g'
/home/mymachine/extras/someapp.jar
/home/mymachine/extras/jsoup-1.10.3.jar
END
)
export TOREE_JAR="/usr/local/share/jupyter/kernels/apache_toree_scala/lib/toree-assembly-0.2.0-spark-1.6.0-cdh5.9.2-incubating.jar"
export SPARK_OPTS="--master yarn-client --conf spark.yarn.config.gatewayPath=/opt/cloudera/parcels --conf spark.scheduler.mode=FAIR --conf spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop --conf spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop --conf spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop --conf spark.yarn.historyServer.address=http://yourCDHcluster.net:18088 --conf spark.default.parallelism=20 --conf spark.driver.maxResultSize=1g --conf spark.driver.memory=1g --conf spark.executor.cores=4 --conf spark.executor.instances=5 --conf spark.executor.memory=1g --packages $SPARK_PKGS --jars $SPARK_JARS"
function jti() {
jupyter toree install \
--replace \
--user \
--kernel_name="CDH 5.9.2 Toree" \
--debug \
--spark_home=${SPARK_HOME} \
--spark_opts="$SPARK_OPTS" \
--log-level=0
}
function jn() {
jupyter notebook --ip=127.0.0.1 --port=8888 --debug --log-level=0
}
If you want a different port to hit Toree - now is your chance to edit 8888
Log out of your Toree / spark-shell machine
ssh back to that machine ssh -L 8888:localhost:8888 toreebox.cdhcluster.net (assuming that 8888 is the port in the bash file)
I expect as a user (not root) you can type jti to install Toree into Jupyter (Note: understanding this step may help to install other kernels into Jupyter - sidebar: #jamcom mentioned
the produced file, but this step automatically produces this part. The file is buried in your home dir's tree as a user rather than root.
As user, type jn to start a Jupyter Notebook. Wait a few seconds until the browser url is available and paste that URL into your browser.
You now have Jupyter running and so pick a new CDH 5.9.2 Toree or the version you installed. This launches a new browser window. Since you have some Toree experience, pick something like sc.getConf.getAll.sortWith(_._1 < _._1).foreach(println) in order to get the lazily instantiated spark context going. Be really patient as your jobs is submitted to the cluster and your may have to wait a long time if your cluster is busy or a little while for your job to process in the cluster.
Tips and Tricks:
I ran into an issue on the first run and the subsequent runs never saw that issue. (The issue issue might be fixed in the github)
Sometimes, I have to kill the old 'Apache Toree' app on YARN to start a new Toree.
Sometimes, my VM can has an orphaned JVM. If you get memory errors starting a Jupyter Notebook/ Toree or have unexpectedly disconnected, check your process list with top. And ... kill the extra JVM (be careful ID-ing your lost process).

Livy Server on Amazon EMR hangs on Connecting to ResourceManager

I'm trying to deploy a Livy Server on Amazon EMR. First I built the Livy master branch
mvn clean package -Pscala-2.11 -Pspark-2.0
Then, I uploaded it to the EMR cluster master. I set the following configurations:
livy-env.sh
SPARK_HOME=/usr/lib/spark
HADOOP_CONF_DIR=/etc/hadoop/conf
livy.conf
livy.spark.master = yarn
livy.spark.deployMode = cluster
When I start Livy, it hangs indefinitely while connecting to YARN Resource manager (XX.XX.XXX.XX is the IP address)
16/10/28 17:56:23 INFO RMProxy: Connecting to ResourceManager at /XX.XX.XXX.XX:8032
However when I netcat the port 8032, it connects successfully
nc -zv XX.XX.XXX.XX 8032
Connection to XX.XX.XXX.XX 8032 port [tcp/pro-ed] succeeded!
I think I'm probably missing some step. Anyone has any idea of what this step might be?
I made the following changes to the config files after unzipping the livy-server-0.2.0.zip file
livy-env.sh
export SPARK_HOME=/usr/hdp/current/spark-client
export HADOOP_HOME=/usr/hdp/current/hadoop-client/bin/
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_CONF_DIR=$SPARK_HOME/conf
export LIVY_LOG_DIR=/jobserver-livy/logs
export LIVY_PID_DIR=/jobserver-livy
export LIVY_MAX_LOG_FILES=10
export HBASE_HOME=/usr/hdp/current/hbase-client/bin
livy.conf
livy.rsc.rpc.server.address=<Loop Back address>
Add 'spark.master yarn-cluster' in the 'spark-defaults.conf' file which is under spark conf folder.
Please let me know if you still have issues.
You can use the following in your log4j.properties, Please post the log file.
log4j.rootCategory=DEBUG, NotConsole
log4j.appender.NotConsole=org.apache.log4j.RollingFileAppender
log4j.appender.NotConsole.File=/<LIVY SERVER INSTALL PATH>/logs/livy.log
log4j.appender.NotConsole.maxFileSize=20MB
log4j.appender.NotConsole.layout=org.apache.log4j.PatternLayout
log4j.appender.NotConsole.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
Looking at the github repo , it looks like master branch is under-development and there is a separate release branch for 0.2 version. The straighforward way (which worked for me) to install livy is to follow the steps in the quickstart page: http://livy.io/quickstart.html
Download Livy Server distribution
wget http://archive.cloudera.com/beta/livy/livy-server-0.2.0.zip
unzip
unzip livy-server-0.2.0.zip
start
$ cd livy-server-0.2.0
$ ./bin/livy-server
16/11/07 20:32:51 INFO LivyServer: Using spark-submit version 2.0.0
16/11/07 20:32:51 WARN RequestLogHandler: !RequestLog
16/11/07 20:32:51 INFO WebServer: Starting server on http://ip-xx-xx-xx-xxx.us-west-2.compute.internal:8998

How to enable Spark mesos docker executor?

I'm working on integration between Mesos & Spark. For now, I can start SlaveMesosDispatcher in a docker; and I like to also run Spark executor in Mesos docker. I do the following configuration for it, but I got an error; any suggestion?
Configuration:
Spark: conf/spark-defaults.conf
spark.mesos.executor.docker.image ubuntu
spark.mesos.executor.docker.volumes /usr/bin:/usr/bin,/usr/local/lib:/usr/local/lib,/usr/lib:/usr/lib,/lib:/lib,/home/test/workshop/spark:/root/spark
spark.mesos.executor.home /root/spark
#spark.executorEnv.SPARK_HOME /root/spark
spark.executorEnv.MESOS_NATIVE_LIBRARY /usr/local/lib
NOTE: The spark are installed in /home/test/workshop/spark, and all dependencies are installed.
After submit SparkPi to the dispatcher, the driver job is started but failed. The error messes is:
I1015 11:10:29.488456 18697 exec.cpp:134] Version: 0.26.0
I1015 11:10:29.506619 18699 exec.cpp:208] Executor registered on slave b7e24114-7585-40bc-879b-6a1188cb65b6-S1
WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
/bin/sh: 1: ./bin/spark-submit: not found
Does any know how to map/set spark home in docker for this case?
I think the issue you're seeing here is a result of the current working directory of the container isn't where Spark is installed. When you specify a docker image for Spark to use with Mesos, it expects the default working directory of the container to be inside $SPARK_HOME where it can find ./bin/spark-submit.
You can see that logic here.
It doesn't look like you're able to configure the working directory through Spark configuration itself, which means you'll need to build a custom image on top of ubuntu that simply does a WORKDIR /root/spark.

Spark is not started automatically on the AWS cluster - how to launch it?

A spark cluster has been launched using the ec2/spark-ec2 script from within the branch-1.4 codebase. I have logged onto it.
I can login to it - and it reflects 1 master, 2 slaves:
11:35:10/sparkup2 $ec2/spark-ec2 -i ~/.ssh/hwspark14.pem login hwspark14
Searching for existing cluster hwspark14 in region us-east-1...
Found 1 master, 2 slaves.
Logging into master ec2-54-83-81-165.compute-1.amazonaws.com...
Warning: Permanently added 'ec2-54-83-81-165.compute-1.amazonaws.com,54.83.81.165' (RSA) to the list of known hosts.
Last login: Tue Jun 23 20:44:05 2015 from c-73-222-32-165.hsd1.ca.comcast.net
__| __|_ )
_| ( / Amazon Linux AMI
___|\___|___|
https://aws.amazon.com/amazon-linux-ami/2013.03-release-notes/
Amazon Linux version 2015.03 is available.
But .. where are they?? The only java processes running are:
Hadoop: NameNode and SecondaryNode
Tachyon: Master and Worker
It is a surprise to me that the Spark Master and Workers are not started. When looking for the processes to start them manually it is not at all obvious where they are located.
Hints on
why spark did not start automatically
and
where the launch scripts live
would be appreciated. (In the meantime i will do an exhaustive
find / -name start-all.sh
And .. survey says:
root#ip-10-151-25-94 etc]$ find / -name start-all.sh
/root/persistent-hdfs/bin/start-all.sh
/root/ephemeral-hdfs/bin/start-all.sh
Which means to me that spark were not even installed??
Update I wonder: is this a bug in 1.4.0? I ran same set of commands in 1.3.1 and the spark cluster came up.
There was a bug in spark 1.4.0 provisioning script which is cloned from github repository by spark-ec2 (https://github.com/mesos/spark-ec2/) with similar symptoms - apache spark haven't started. The reason was - provisioning script failed to download spark archive.
Check was spark downloaded and uncompressed on the master host ls -altr /root/spark there should be several directories there. From your description looks like /root/spark/sbin/start-all.sh script is missing - which is missing there.
Also check the contents of the file cat /tmp/spark-ec2_spark.log it should has information about uncompressing step.
Another thing to try is to run spark-ec2 with other provisioning script branch by adding --spark-ec2-git-branch branch-1.4 into the spark-ec2 command line argument.
Also when you run spark-ec2 save all output and check is there something suspicious:
spark-ec2 <...args...> 2>&1 | tee start.log

Resources