How to connect JMXConsole remotely to Spark streaming application - apache-spark

I have a Spark streaming application running in a yarn-cluster mode reading from a Kafka topic.
I want to connect JMXConsole or the Java visualvm to these remote processes in a Cloudera distribution to gather some performance benchmarks.
How would I go about doing that?

The way I've done this is to set/add the following property (Also start Flight Recorder):
spark.executor.extraJavaOptions=-XX:+UnlockCommercialFeatures -XX:+FlightRecorder -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=0
If you have only one worker running on each box, you can set the port to be fixed. If you have multiple, then you need to go with port 0 and the use lsof to find which port got assigned,.

Related

Exporting spark executor jmx metrics for multiple executors running in same machine

I am trying to scrape metrics for spark driver and executor using javaagent with below options. I have Prometheus in kubernetes cluster and I am running this spark application outside the kubernetes cluster.
spark.executor.extraJavaOptions=-javaagent:/opt/clkd/prometheus/jmx_prometheus_javaagent-0.3.1.jar=53700:executor_pattern.yaml
but I got below exception since both executors running on the same machine
Caused by: java.net.BindException: Address already in use ....
I see many have posted the same question but I couldn't find the answer. Please let me know how can I resolve this issue.
I think that you need to switch from the pull-based monitoring, to push-based monitoring. For things, such as Spark jobs it makes more sense, as they aren't running all the time. For that you have some alternatives:
Spark Prometheus Sink from Banzai Cloud as outlined in their blog post
Setup GraphiteSink as described in the Spark documentation, and point it to the https://github.com/prometheus/graphite_exporter, and then scrape metrics from that exporter
Initial answer:
You can't have 2 processes listening on the same port, so just bind Prometheus from different jobs onto the different ports. Port is the number after the jmx_prometheus_javaagent-0.3.1.jar=, and before : character - in your case it's 53700. So you can use one port for one task, and another port (maybe 53701) for 2nd task...

Exporting metrics from multiple spark exeutors running on single node using prometheus JMX agent

I am running spark cluster and I have one node which has three executors running on it. I want to scrap metrics for all three executors using Prometheus JMX agent. I am passing Prometheus java agent using "spark.executor.extraJavaOptions" in spark submit command like below.
--conf "spark.executor.extraJavaOptions=-javaagent:/opt/agent/jmx_prometheus_javaagent-0.3.1.jar=6677:/opt/agent/spark.yml"
I am passing port as 6677 and JMX metrics are available for one executor only. For other two executors javaagent will fail as port 6677 is already in use and there will be no metrics reported for other two executors. Can someone please guide me how to solve this problem. I found similar question here but there is no answer for that.
Use different ports for the other 2 executors? You can't have 3 servers listening on the same port and this has nothing to do with Prometheus or JMX.

Find the leader Node in a Spark Standalone Cluster with Zookeeper

Hi I'm using Spark Standalone cluster with zookeeper
Before doing spark submit I need to find the leader node from the Spark-Cluster.
My Question is how to find the leader node across all the spark-master nodes:
1> Can it be fetched from Zookeeper ?
2> Is there is any API exposed by spark-master to check that ?
Firstly, in a Spark cluster there is no leader node. There is one Alive Master, one or more Standby Masters and one or more Slaves. Secondly, when you submit a task to Spark, you don't need to know which one is the active master. You can provide all the Spark Masters ips and the cluster will take care of everything.
However, if you still want to see this information, the easiest way is by accessing the web ui which is usually available on port 8080. You can check the web ui port by looking at the Spark Master process details:
ps -ef | grep spark
stefan 12682 1 15 09:50 pts/1 00:00:04 /usr/lib/jvm/java-8-oracle/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://spark-ip:7077
By accessing this web ui at http://spark-ip:port, you will be available to see all the details about that master server. If you want to see this data in json format, add /json at the end.

How to expose Spark Driver behind dockerized Apache Zeppelin?

I am currently building a custom docker container from a plain distribution with Apache Zeppelin + Spark 2.x inside.
My Spark jobs will run in a remote cluster and I am using yarn-client as master.
When I run a notebook and try to print sc.version, the program gets stuck. If I go to the remote resource manager, an application has been created and accepted but in the logs I can read:
INFO yarn.ApplicationMaster: Waiting for Spark driver to be reachable
My understanding of the situation is that the cluster is unable to talk to the driver in the container but I don't know how to solve this issue.
I am currently using the following configuration:
spark.driver.port set to PORT1 and option -p PORT1:PORT1 passed to the container
spark.driver.host set to 172.17.0.2 (ip of the container)
SPARK_LOCAL_IP set to 172.17.0.2 (ip of the container)
spark.ui.port set to PORT2 and option -p PORT2:PORT2 passed to the container
I have the feeling I should change the SPARK_LOCAL_IP to the host ip but if I do so, SparkUI is unable to start, blocking the process a step before.
Thanks in advance for any ideas / advices !
Good question! First of all, as you know Apache Zeppelin runs interpreters in a separate processes.
In your case, Spark interpreter JVM process hosts a SparkContext and serves as aSparkDriver instance for the yarn-client deployment mode. This process inside the container, according to the Apache Spark documentation, needs to be able to communicate back and forth to\from YARN ApplicationMaster and all SparkWorkers machines of the cluster.
This implies that you have to have number of ports open and manually forwarded between the container and a host machine. Here is an example of a project at ZEPL doing similar job, where it took us 7 ports to get the job done.
Anoter aproach can be running Docker networking in a host mode (though it apparently does not work on os x, due to a recent bug)

What is the difference between web UIs on 4040 and 8080?

There are two different web UIs (one is for standalone mode only). Can I use web UI on port 4040 when I am launching Spark in standalone mode? (example:spark-class.cmd org.apache.spark.deploy.master.Master- web ui 8080 is working, 4040 - not.) What is the main difference between these UIs?
Is it possible for me to launch Spark (without hadoop, hdfs, yarn etc), to keep it up and to submit my jars(classes) into it? I want to watch job statistics after it finishes. I am trying something like this:
Server: Spark\bin>spark-class.cmd org.apache.spark.deploy.master.Master
Worker: Spark\bin>spark-class.cmd org.apache.spark.deploy.worker.Worker spark://169.254.8.45:7077 --cores 4 --memory 512M
Submit: Spark\bin>spark-submit.cmd --class demo.TreesSample --master spark://169.254.8.45:7077 file:///E:/spark-demo/target/demo.jar
It runs. It gets new WebUI on port 4040 up for this task. I dont see anything in Master's ui on 8080.
Currently I'm using win7 x64, spark-1.5.2-bin-hadoop2.6. I can switch into linux if it matters.
You should be able to change the web UI port for standalone Master using spark.master.ui.port or SPARK_MASTER_WEBUI_PORT as described in Configuring Ports for Network Security / Standalone mode only.
Standalone Master's web UI is a management console of a cluster manager (that happens to be part of Apache Spark, but could've been a separate product as Hadoop YARN and Apache Mesos). Having said that, it can often be confusing what the two web UIs have in common, and the answer is nothing.
The Spark driver's web UI is to show the progress of your computations (jobs, stages, storage for RDD persistence, broadcasts, accumulators) while standalone Master's web UI is to let you know the current state of your "operating environment" (aka the Spark Standalone cluster).
I leave the other part of your question about History server to #Sumit's answer.
Yes, you can launch the Spark as a standalone server, without any Hadoop or HDFS. Also as soon as you submit your job to master, it will show your job either in in-"Running jobs" or "Jobs Completed" section.
You can also enable History Server for preserving the job Statistics and analyzing the same at a later time -
./sbin/start-history-server.sh
Refer Here for more details on enabling History server

Resources