Memory,CPU,GPU profiling in containerized Spark cluster - apache-spark

any suggestion in which library/tool should I use for plotting over time RAM,CPU and (optionally) GPU usage of a spark-app submitted to a Docker containerized Spark cluster through spark-submit?
In the documentation Apache suggests to use memory_profiler with commands like:
python -m memory_profiler profile_memory.py
but after accessing to my master node through a remote shell:
docker exec -it spark-master bash
I can't launch locally my spark apps because I need to use the spark-submit command in order to submit it to the cluster.
Any suggestion? I launch the apps w/o YARN but in cluster mode through
/opt/spark/spark-submit --master spark://spark-master:7077 appname.py
I would like also to know if I can use memory_profiler even if I need to use spark-submit

Related

How can I run spark-submit commands using the GCP spark operator on kubernetes

I have a spark application which i want to deploy on kubernetes using the GCP spark operatorhttps://github.com/GoogleCloudPlatform/spark-on-k8s-operator.
I was able to run a spark application using command kubectl apply -f example.yaml but i want to use spark-submit commands.
There are few options mentione by https://github.com/big-data-europe/docker-spark which can use
see if that solves your problem
kubectl run spark-base --rm -it --labels="app=spark-client" --image bde2020/spark-base:2.4.5-hadoop2.7 -- bash ./spark/bin/spark-shell --master spark://spark-master:7077 --conf spark.driver.host=spark-client
or
kubectl run spark-base --rm -it --labels="app=spark-client" --image bde2020/spark-base:2.4.5-hadoop2.7 -- bash ./spark/bin/spark-submit --class CLASS_TO_RUN --master spark://spark-master:7077 --deploy-mode client --conf spark.driver.host=spark-client URL_TO_YOUR_APP
There is no way to manipulate directly the spark-submit command that the spark operator generates when it translates the yaml configuration file to spark specific options and kubernetes resources. This is kind of the point of using the operator. It lets you use a yaml config file to run either a SparkApplication or a ScheduledSparkApplication like if it were a kubernetes resource. Most options can be set either with hadoop or spark config files in config maps or as command line arguments to the jvm in the driver and executor pods. I recommend to use this last approach in order to have more flexibility when it comes to fine tuning spark jobs

How to use HDFS HA in spark on k8s?

My environment is CDH5.11 with HDFS HA mode,I submit application use SparkLauncher in my windows PC,when I code like
setAppResource("hdfs://ip:port/insertInToSolr.jar")
it worked,but when I code like
setAppResource("hdfs://hdfsHA_nameservices/insertInToSolr.jar")
it does not work.
I have copy my hadoop config to the spark docker image by modify
$SPARK_HOME/kubernetes/dockerfiles/spark/entrypoint.sh
When I use docker run -it IMAGE ID /bin/bash
to run a CONTAINER, in the CONTAINER I can use spark-shell to read hdfs and hive.

Spark submit from application running in Mesos DCOS cluster

I have a Mesos DCOS cluster running on AWS with Spark installed via the dcos package install spark command. I am able to successfully execute Spark jobs using the DCOS CLI: dcos spark run ...
Now I would like to execute Spark jobs from a Docker container running inside the Mesos cluster, but I'm not quite sure how to reach the running instance of spark. The idea would be to have a docker container execute the spark-submit command to submit a job to the Spark deployment instead of executing the same job from outside the cluster with the DCOS CLI.
Current documentation seems to be focused only on running Spark via the DCOS CLI - is there any way to reach the spark deployment from another application running inside the cluster?
DCOS IOT demo try something similar. https://github.com/amollenkopf/dcos-iot-demo
This guys run a spark docker and spark-submit in a marathon app. Check this Marathon descriptor: https://github.com/amollenkopf/dcos-iot-demo/blob/master/spatiotemporal-esri-analytics/rat01.json

Spark Mesos Cluster Mode using Dispatcher

I have only a single machine and want to run spark jobs with mesos cluster mode. It might make more sense to run with a cluster of nodes, but I mainly want to test out mesos first to check if it's able to utilize resources more efficiently (run multiple spark jobs at the same time without static partitioning). I have tried a number of ways but without success. Here is what I did:
Build mesos and run both mesos master and slaves (2 slaves in same machines).
sudo ./bin/mesos-master.sh --ip=127.0.0.1 --work_dir=/var/lib/mesos
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5051 --work_dir=/tmp/mesos1
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5052 --work_dir=/tmp/mesos2
Run the spark-mesos-dispatcher
sudo ./sbin/start-mesos-dispatcher.sh --master mesos://localhost:5050
The submit the app with dispatcher as master url.
spark-submit --master mesos://localhost:7077 <other-config> <jar file>
But it doesnt work:
E0925 17:30:30.158846 807608320 socket.hpp:174] Shutdown failed on fd=61: Socket is not connected [57]
E0925 17:30:30.159545 807608320 socket.hpp:174] Shutdown failed on fd=62: Socket is not connected [57]
If I use spark-submit --deploy-mode cluster, then I got another error message:
Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server
It work perfectly if I don't use dispatcher but using mesos master url directly: --master mesos://localhost:5050 (client mode). According to the documentation , cluster mode is not supported for Mesos clusters, but they give another instruction for cluster mode here. So it's kind of confusing? My question is:
How I can get it works?
Should I use client mode instead of cluster mode if I submit the app/jar directly from the master node?
If I have a single computer, should I spawn 1 or more mesos slave processes. Basically, I have a number of spark job and dont want to do static partitioning of resources. But when using mesos without static partitioning, it seems to be much slower?
Thanks.
There seem to be two things you're confusing: launching a Spark application in a cluster (as opposed to locally) and launching the driver into the cluster.
From the top of Submitting Applications:
The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application specially for each one.
So, Mesos is one of the supported cluster managers and hence you can run Spark apps on a Mesos cluster.
What Mesos as time of writing does not support is launching the driver into the cluster, this is what the command line argument --deploy-mode of ./bin/spark-submitspecifies. Since the default value of --deploy-mode is client you can just omit it, or if you want to explicitly specify it, then use:
./bin/spark-submit --deploy-mode client ...
I use your scenario to try, it could be work.
One thing different , I use ip address to instead of "localhost" and "127.0.0.1"
So just try again and to check http://your_dispatcher:8081 (on browser) if exist.
This is my spark-submit command:
$spark-submit --deploy-mode cluster --master mesos://192.168.11.79:7077 --class "SimpleApp" SimpleAppV2.jar
If success, you can see as below
{
"action" : "CreateSubmissionResponse",
"serverSparkVersion" : "1.5.0",
"submissionId" : "driver-20151006164749-0001",
"success" : true
}
When I got your error log as yours, I reboot the machine and retry your step. It also work.
Try using the 6066 port instead of 7077. The newer versions of Spark prefer the REST api for submitting jobs.
See https://issues.apache.org/jira/browse/SPARK-5388

Spark - How to run a standalone cluster locally

Is there the possibility to run the Spark standalone cluster locally on just one machine (which is basically different from just developing jobs locally (i.e., local[*]))?.
So far I am running 2 different VMs to build a cluster, what if I could run a standalone cluster on the very same machine, having for instance three different JVMs running?
Could something like having multiple loopback addresses do the trick?
yes you can do it, launch one master and one worker node and you are good to go
launch master
./sbin/start-master.sh
launch worker
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077 -c 1 -m 512M
run SparkPi example
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://localhost:7077 lib/spark-examples-1.2.1-hadoop2.4.0.jar
Apache Spark Standalone Mode Documentation
A small update as for the latest version (the 2.1.0), the default is to bind the master to the hostname, so when starting a worker locally use the output of hostname:
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://`hostname`:7077 -c 1 -m 512M
And to run an example, simply run the following command:
bin/run-example SparkPi
If you can't find the ./sbin/start-master.sh file on your machine, you can start the master also with
./bin/spark-class org.apache.spark.deploy.master.Master
More simply,
./sbin/start-all.sh
On your local machine there will be master and one worker launched.
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://localhost:7077 \
examples/jars/spark-examples_2.12-3.0.1.jar 10000
A sample application is submitted. For monitoring via Web UI:
Master UI: http://localhost:8080
Worker UI: http://localhost:8081
Application UI: http://localhost:4040

Resources