Run spark-shell inside Docker container against remote YARN cluster - apache-spark

May be someone already had experience to build docker image for spark?
I want to build docker image with spark inside but configured against remote YARN cluster.
I have already created image with spark 1.6.2 inside.
But when I run
spark-shell --master yarn --deploy-mode client --driver-memory 32G --executor-memory 32G --executor-cores 8
inside docker I get the following exception
Diagnostics: java.io.FileNotFoundException: File file:/usr/local/spark/lib/spark-assembly-1.6.2-hadoop2.2.0.jar does not exist
Any suggestions?
Do I need to load spark-assembly i HDFS and set spark.yarn.jar=hdfs://spark-assembly-1.6.2-hadoop2.2.0.jar ?
Here is my Dockerfile
https://gist.github.com/ponkin/cac0a071e7fe75ca7c390b7388cf4f91

Related

Standard way to store/upload application jar on Spark cluster on Kubernetes

I have a Spark based Kubernetes cluster where I am using spark-submit to submit the jobs on cluster as needed.
e.g.
spark-submit \
--master spark://my-spark-master-svc:7077 \
--class com.Main \
examples/jars/my-spark-application.jar
Here I have uploaded file my-spark-application.jar using kubectl cp under the directory examples/jars on the master Pod/container before I run the spark-submit command.
Another option could be by mounting a Volume on the cluster and share the jar on the volume that way.
What is the typical way to share the application jar with the spark cluster while using spark-submit on Kubernetes?

Some spark-submit config options not reflected in k8s pod

I'm using spark-submit to create a spark driver pod on my k8s cluster. When I run
bin/spark-submit
--master k8s://https://my-cluster-url:443
--deploy-mode cluster
--name spark-test
--class com.my.main.Class
--conf spark.executor.instances=3
--conf spark.kubernetes.allocation.batch.size=3
--conf spark.kubernetes.namespace=my-namespace
--conf spark.kubernetes.container.image.pullSecrets=my-cr-secret
--conf spark.kubernetes.container.image.pullPolicy=Always
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.my-pvc.mount.path=/var/service/src/main/resources/
--conf spark.kubernetes.container.image=my-registry.io/spark-test:test-3.0.0
local:///var/service/my-service-6.3.0-RELEASE.jar
spark-submit successfully creates a pod in my k8s cluster. However, many of the config options I specified are not seen. For example, the pod does not have a volume mounted at /var/service/src/main/resources/ despite the existence of a persistentVolumeClaim on the cluster called my-pvc. Further, the pod has not been given the specified image pull secret my-cr-secret, causing an ImagePullBackOff error. On the other hand, the pod is properly created in the my-namespace namespace and the pull policy Always.
I have attempted this using spark 3.0.0 and 2.4.5
Why are some config options not reflected in the pod created on my cluster?
Figured out the issue:
I currently have spark 2.3.1 installed locally and the variable SPARK_HOME points to /usr/local/spark. For this current project I downloaded a distribution of spark 2.4.5. I was in the 2.4.5 directory and running bin/spark-submit, which should have (as far as I can tell) pointed to the spark-submit bundled in 2.4.5. However, running bin/spark-submit --version revealed that the version being run was 2.3.1. The configurations that were being ignored in my question above were not available in 2.3.1.
Simply changing SPARK_HOME to the new directory fixed the issue

What are the alternatives to run Spark application in cluster mode?

I am writing a Spark application by using Scala. My application is packaged in a jar file by using maven, I can run my application in the local mode (standalone) with this command :
spark-submit --class classes.mainClass --master local --driver-memory 30G logs-0.0.7-SNAPSHOT-jar-with-dependencies
My question : How I can try my application in cluster mode?
I need to check my application by using 1, 2, 3 ... cluster machines.

Deploy Spark into Kubernetes Cluster

I'm newbie in Kubernetes & Spark Environment.
I'm requested to deploy Spark inside Kubernetes so that it's can be auto Horizontal Scalling.
The problem is, I can't deploy SparkPi example from official website(https://spark.apache.org/docs/latest/running-on-kubernetes#cluster-mode).
I've already follow the instruction, but the pods failed to execute.
Here is the explanation :
Already run : Kubectl proxy
When execute :
spark-submit --master k8s://https://localhost:6445 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=5 --conf spark.kubernetes.container.image=xnuxer88/spark-kubernetes-bash-test-entry:v1 local:///opt/spark/examples/jars/spark-examples_2.11-2.3.2.jar
Get Error :
Error: Could not find or load main class org.apache.spark.examples.SparkPi
When I check the docker image (create the container from related image), I found the file.
Is there any missing instruction that I forgot to follow?
Please Help.
Thank You.

Submit docker which contains fat jar to Spark cluster

I want to submit a docker container which contains 'fat jar' to a Spark cluster running on DC/OS. Here's what I have done.
mvn clean install, so the jar resides here /target/application.jar
docker build -t <repo/image> . && docker push <repo/image>
Now my DC/OS is able to pull the image from my private repository
My Dockerfile looks like this:
FROM docker-release.com/spark:0.1.1-2.1.0-2.8.0 # I extended from this image to get all necessary components
ADD target/application.jar /application.jar # just put fat jar under root dir of Docker image
COPY bootstrap.sh /etc/bootstrap.sh
ENTRYPOINT ["/etc/bootstrap.sh"]
Here's what bootstrap.sh looks like:
#!/bin/bash -e
/usr/local/spark/bin/spark-submit --class com.spark.sample.MainClass --master spark://<host>:<port> --deploy-mode cluster --executor-memory 20G --total-executor-cores 100 /application.jar
I deployed this image as a service to DC/OS, where Spark cluster also runs, and the service successfully submit to Spark cluster. However, Spark cluster is not able to locate the jar because it sits in the service docker.
I0621 06:06:25.985144 8760 fetcher.cpp:167] Copying resource with
command:cp '/application.jar'
'/var/lib/mesos/slave/slaves/e8a89a81-1da6-46a2-8caa-40a37a3f7016-S4/frameworks/e8a89a81-1da6-46a2-8caa-40a37a3f7016-0003/executors/driver-20170621060625-18190/runs/c8e710a6-14e3-4da5-902d-e554a0941d27/application.jar'
cp: cannot stat '/application.jar': No such file or directory
Failed to fetch '/application.jar':
Failed to copy with command 'cp '/application.jar'
'/var/lib/mesos/slave/slaves/e8a89a81-1da6-46a2-8caa-40a37a3f7016-S4/frameworks/e8a89a81-1da6-46a2-8caa-40a37a3f7016-0003/executors/driver-20170621060625-18190/runs/c8e710a6-14e3-4da5-902d-e554a0941d27/application.jar'',
exit status: 256 Failed to synchronize with agent (it's probably
exited)
My question is:
Does the jar need to be placed somewhere other than inside the Docker container? It doesn't make any sense to me, but if not, how can Spark correctly find the jar file?

Resources