Submit docker which contains fat jar to Spark cluster - apache-spark

I want to submit a docker container which contains 'fat jar' to a Spark cluster running on DC/OS. Here's what I have done.
mvn clean install, so the jar resides here /target/application.jar
docker build -t <repo/image> . && docker push <repo/image>
Now my DC/OS is able to pull the image from my private repository
My Dockerfile looks like this:
FROM docker-release.com/spark:0.1.1-2.1.0-2.8.0 # I extended from this image to get all necessary components
ADD target/application.jar /application.jar # just put fat jar under root dir of Docker image
COPY bootstrap.sh /etc/bootstrap.sh
ENTRYPOINT ["/etc/bootstrap.sh"]
Here's what bootstrap.sh looks like:
#!/bin/bash -e
/usr/local/spark/bin/spark-submit --class com.spark.sample.MainClass --master spark://<host>:<port> --deploy-mode cluster --executor-memory 20G --total-executor-cores 100 /application.jar
I deployed this image as a service to DC/OS, where Spark cluster also runs, and the service successfully submit to Spark cluster. However, Spark cluster is not able to locate the jar because it sits in the service docker.
I0621 06:06:25.985144 8760 fetcher.cpp:167] Copying resource with
command:cp '/application.jar'
'/var/lib/mesos/slave/slaves/e8a89a81-1da6-46a2-8caa-40a37a3f7016-S4/frameworks/e8a89a81-1da6-46a2-8caa-40a37a3f7016-0003/executors/driver-20170621060625-18190/runs/c8e710a6-14e3-4da5-902d-e554a0941d27/application.jar'
cp: cannot stat '/application.jar': No such file or directory
Failed to fetch '/application.jar':
Failed to copy with command 'cp '/application.jar'
'/var/lib/mesos/slave/slaves/e8a89a81-1da6-46a2-8caa-40a37a3f7016-S4/frameworks/e8a89a81-1da6-46a2-8caa-40a37a3f7016-0003/executors/driver-20170621060625-18190/runs/c8e710a6-14e3-4da5-902d-e554a0941d27/application.jar'',
exit status: 256 Failed to synchronize with agent (it's probably
exited)
My question is:
Does the jar need to be placed somewhere other than inside the Docker container? It doesn't make any sense to me, but if not, how can Spark correctly find the jar file?

Related

Standard way to store/upload application jar on Spark cluster on Kubernetes

I have a Spark based Kubernetes cluster where I am using spark-submit to submit the jobs on cluster as needed.
e.g.
spark-submit \
--master spark://my-spark-master-svc:7077 \
--class com.Main \
examples/jars/my-spark-application.jar
Here I have uploaded file my-spark-application.jar using kubectl cp under the directory examples/jars on the master Pod/container before I run the spark-submit command.
Another option could be by mounting a Volume on the cluster and share the jar on the volume that way.
What is the typical way to share the application jar with the spark cluster while using spark-submit on Kubernetes?

How can I run spark-submit commands using the GCP spark operator on kubernetes

I have a spark application which i want to deploy on kubernetes using the GCP spark operatorhttps://github.com/GoogleCloudPlatform/spark-on-k8s-operator.
I was able to run a spark application using command kubectl apply -f example.yaml but i want to use spark-submit commands.
There are few options mentione by https://github.com/big-data-europe/docker-spark which can use
see if that solves your problem
kubectl run spark-base --rm -it --labels="app=spark-client" --image bde2020/spark-base:2.4.5-hadoop2.7 -- bash ./spark/bin/spark-shell --master spark://spark-master:7077 --conf spark.driver.host=spark-client
or
kubectl run spark-base --rm -it --labels="app=spark-client" --image bde2020/spark-base:2.4.5-hadoop2.7 -- bash ./spark/bin/spark-submit --class CLASS_TO_RUN --master spark://spark-master:7077 --deploy-mode client --conf spark.driver.host=spark-client URL_TO_YOUR_APP
There is no way to manipulate directly the spark-submit command that the spark operator generates when it translates the yaml configuration file to spark specific options and kubernetes resources. This is kind of the point of using the operator. It lets you use a yaml config file to run either a SparkApplication or a ScheduledSparkApplication like if it were a kubernetes resource. Most options can be set either with hadoop or spark config files in config maps or as command line arguments to the jvm in the driver and executor pods. I recommend to use this last approach in order to have more flexibility when it comes to fine tuning spark jobs

How to use HDFS HA in spark on k8s?

My environment is CDH5.11 with HDFS HA mode,I submit application use SparkLauncher in my windows PC,when I code like
setAppResource("hdfs://ip:port/insertInToSolr.jar")
it worked,but when I code like
setAppResource("hdfs://hdfsHA_nameservices/insertInToSolr.jar")
it does not work.
I have copy my hadoop config to the spark docker image by modify
$SPARK_HOME/kubernetes/dockerfiles/spark/entrypoint.sh
When I use docker run -it IMAGE ID /bin/bash
to run a CONTAINER, in the CONTAINER I can use spark-shell to read hdfs and hive.

submit spark job from local to emr ssh setup

I am new to spark. I want to submit a spark job from local to a remote EMR cluster.
I am following the link here to set up all the prerequisites: https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/
here is the command as below:
spark-submit --class mymain --deploy-mode client --master yarn myjar.jar
Issue: sparksession creation is not able to be finished with no error. Seems an access issue.
From the aws document, we know that by given the master with yarn, yarn uses the config files I copied from EMR to know where is the master and slaves (yarn-site.xml).
As my EMR cluster is located in a VPC, which need a special ssh config to access, how could I add this info to yarn so it can access to the remote cluster and submit the job?
I think the resolution proposed in aws link is more like - create your local spark setup with all dependencies.
If you don't want to do local spark setup, I would suggest easier way would be, you can use:
1. Livy: for this you emr setup should have livy installed. Check this, this, this and you should be able to infer from this
2. EMR ssh: this requires you to have aws-cli installed locally, cluster id and pem file used while creating emr cluster. Check this
Eg. aws emr ssh --cluster-id j-3SD91U2E1L2QX --key-pair-file ~/.ssh/mykey.pem --command 'your-spark-submit-command' (This prints command output on console though)

Run spark-shell inside Docker container against remote YARN cluster

May be someone already had experience to build docker image for spark?
I want to build docker image with spark inside but configured against remote YARN cluster.
I have already created image with spark 1.6.2 inside.
But when I run
spark-shell --master yarn --deploy-mode client --driver-memory 32G --executor-memory 32G --executor-cores 8
inside docker I get the following exception
Diagnostics: java.io.FileNotFoundException: File file:/usr/local/spark/lib/spark-assembly-1.6.2-hadoop2.2.0.jar does not exist
Any suggestions?
Do I need to load spark-assembly i HDFS and set spark.yarn.jar=hdfs://spark-assembly-1.6.2-hadoop2.2.0.jar ?
Here is my Dockerfile
https://gist.github.com/ponkin/cac0a071e7fe75ca7c390b7388cf4f91

Resources