How to use HDFS HA in spark on k8s? - apache-spark

My environment is CDH5.11 with HDFS HA mode,I submit application use SparkLauncher in my windows PC,when I code like
setAppResource("hdfs://ip:port/insertInToSolr.jar")
it worked,but when I code like
setAppResource("hdfs://hdfsHA_nameservices/insertInToSolr.jar")
it does not work.
I have copy my hadoop config to the spark docker image by modify
$SPARK_HOME/kubernetes/dockerfiles/spark/entrypoint.sh
When I use docker run -it IMAGE ID /bin/bash
to run a CONTAINER, in the CONTAINER I can use spark-shell to read hdfs and hive.

Related

Memory,CPU,GPU profiling in containerized Spark cluster

any suggestion in which library/tool should I use for plotting over time RAM,CPU and (optionally) GPU usage of a spark-app submitted to a Docker containerized Spark cluster through spark-submit?
In the documentation Apache suggests to use memory_profiler with commands like:
python -m memory_profiler profile_memory.py
but after accessing to my master node through a remote shell:
docker exec -it spark-master bash
I can't launch locally my spark apps because I need to use the spark-submit command in order to submit it to the cluster.
Any suggestion? I launch the apps w/o YARN but in cluster mode through
/opt/spark/spark-submit --master spark://spark-master:7077 appname.py
I would like also to know if I can use memory_profiler even if I need to use spark-submit

Connection from local machine installed Zeppelin to Docker Spark cluster

I am trying to configure Spark interpreter on a local machine installed Zeppelin version 0.10.0 so that I can run scripts on a Spark cluster created also local on Docker. I am using docker-compose.yml from https://github.com/big-data-europe/docker-spark and Spark version 3.1.2. After docker compose-up, I can see in the browser spark-master on localhost:8080 and History Server on localhost:18081. After reading the ID of the spark-master container, I can also run shell and spark-shell on it (docker exec -it xxxxxxxxxxxx /bin/bash). As host OS I am using Ubuntu 20.04, the spark.master in Zeppelin is set now to spark://localhost:7077, zeppelin.server.port in zeppelin-site.xml to 8070.
There is a lot of information about connecting a container running Zeppelin or running both Spark and Zeppelin in the same container but unfortunately I also use that Zeppelin to connect to the Hive via jdbc on VirtualBox Hortonworks cluster like in one of my previous posts and I wouldn't want to change that configuration now due to hardware resources. In one of the posts (Running zeppelin on spark cluster mode) I saw that such a connection is possible, unfortunately all attempts end with the "Fail to open SparkInterpreter" message.
I would be grateful for any tips.
You need to change the spark.master in Zeppelin to point to the spark master in the docker container not the local machine. Hence spark://localhost:7077 won't work.
The port 7077 is fine because that is the port specified in the docker-compose file you are using. To get the IP address of the docker container you can follow this answer. Since I suppose your container is named spark-master you can try the following:
docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' spark-master
Then specify this as the spark.master in Zeppelin: spark://docker-ip:7077

submit spark job from local to emr ssh setup

I am new to spark. I want to submit a spark job from local to a remote EMR cluster.
I am following the link here to set up all the prerequisites: https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/
here is the command as below:
spark-submit --class mymain --deploy-mode client --master yarn myjar.jar
Issue: sparksession creation is not able to be finished with no error. Seems an access issue.
From the aws document, we know that by given the master with yarn, yarn uses the config files I copied from EMR to know where is the master and slaves (yarn-site.xml).
As my EMR cluster is located in a VPC, which need a special ssh config to access, how could I add this info to yarn so it can access to the remote cluster and submit the job?
I think the resolution proposed in aws link is more like - create your local spark setup with all dependencies.
If you don't want to do local spark setup, I would suggest easier way would be, you can use:
1. Livy: for this you emr setup should have livy installed. Check this, this, this and you should be able to infer from this
2. EMR ssh: this requires you to have aws-cli installed locally, cluster id and pem file used while creating emr cluster. Check this
Eg. aws emr ssh --cluster-id j-3SD91U2E1L2QX --key-pair-file ~/.ssh/mykey.pem --command 'your-spark-submit-command' (This prints command output on console though)

Submit docker which contains fat jar to Spark cluster

I want to submit a docker container which contains 'fat jar' to a Spark cluster running on DC/OS. Here's what I have done.
mvn clean install, so the jar resides here /target/application.jar
docker build -t <repo/image> . && docker push <repo/image>
Now my DC/OS is able to pull the image from my private repository
My Dockerfile looks like this:
FROM docker-release.com/spark:0.1.1-2.1.0-2.8.0 # I extended from this image to get all necessary components
ADD target/application.jar /application.jar # just put fat jar under root dir of Docker image
COPY bootstrap.sh /etc/bootstrap.sh
ENTRYPOINT ["/etc/bootstrap.sh"]
Here's what bootstrap.sh looks like:
#!/bin/bash -e
/usr/local/spark/bin/spark-submit --class com.spark.sample.MainClass --master spark://<host>:<port> --deploy-mode cluster --executor-memory 20G --total-executor-cores 100 /application.jar
I deployed this image as a service to DC/OS, where Spark cluster also runs, and the service successfully submit to Spark cluster. However, Spark cluster is not able to locate the jar because it sits in the service docker.
I0621 06:06:25.985144 8760 fetcher.cpp:167] Copying resource with
command:cp '/application.jar'
'/var/lib/mesos/slave/slaves/e8a89a81-1da6-46a2-8caa-40a37a3f7016-S4/frameworks/e8a89a81-1da6-46a2-8caa-40a37a3f7016-0003/executors/driver-20170621060625-18190/runs/c8e710a6-14e3-4da5-902d-e554a0941d27/application.jar'
cp: cannot stat '/application.jar': No such file or directory
Failed to fetch '/application.jar':
Failed to copy with command 'cp '/application.jar'
'/var/lib/mesos/slave/slaves/e8a89a81-1da6-46a2-8caa-40a37a3f7016-S4/frameworks/e8a89a81-1da6-46a2-8caa-40a37a3f7016-0003/executors/driver-20170621060625-18190/runs/c8e710a6-14e3-4da5-902d-e554a0941d27/application.jar'',
exit status: 256 Failed to synchronize with agent (it's probably
exited)
My question is:
Does the jar need to be placed somewhere other than inside the Docker container? It doesn't make any sense to me, but if not, how can Spark correctly find the jar file?

Spark submit from application running in Mesos DCOS cluster

I have a Mesos DCOS cluster running on AWS with Spark installed via the dcos package install spark command. I am able to successfully execute Spark jobs using the DCOS CLI: dcos spark run ...
Now I would like to execute Spark jobs from a Docker container running inside the Mesos cluster, but I'm not quite sure how to reach the running instance of spark. The idea would be to have a docker container execute the spark-submit command to submit a job to the Spark deployment instead of executing the same job from outside the cluster with the DCOS CLI.
Current documentation seems to be focused only on running Spark via the DCOS CLI - is there any way to reach the spark deployment from another application running inside the cluster?
DCOS IOT demo try something similar. https://github.com/amollenkopf/dcos-iot-demo
This guys run a spark docker and spark-submit in a marathon app. Check this Marathon descriptor: https://github.com/amollenkopf/dcos-iot-demo/blob/master/spatiotemporal-esri-analytics/rat01.json

Resources