Find the leader Node in a Spark Standalone Cluster with Zookeeper - apache-spark

Hi I'm using Spark Standalone cluster with zookeeper
Before doing spark submit I need to find the leader node from the Spark-Cluster.
My Question is how to find the leader node across all the spark-master nodes:
1> Can it be fetched from Zookeeper ?
2> Is there is any API exposed by spark-master to check that ?

Firstly, in a Spark cluster there is no leader node. There is one Alive Master, one or more Standby Masters and one or more Slaves. Secondly, when you submit a task to Spark, you don't need to know which one is the active master. You can provide all the Spark Masters ips and the cluster will take care of everything.
However, if you still want to see this information, the easiest way is by accessing the web ui which is usually available on port 8080. You can check the web ui port by looking at the Spark Master process details:
ps -ef | grep spark
stefan 12682 1 15 09:50 pts/1 00:00:04 /usr/lib/jvm/java-8-oracle/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://spark-ip:7077
By accessing this web ui at http://spark-ip:port, you will be available to see all the details about that master server. If you want to see this data in json format, add /json at the end.

Related

How to expose Spark Driver behind dockerized Apache Zeppelin?

I am currently building a custom docker container from a plain distribution with Apache Zeppelin + Spark 2.x inside.
My Spark jobs will run in a remote cluster and I am using yarn-client as master.
When I run a notebook and try to print sc.version, the program gets stuck. If I go to the remote resource manager, an application has been created and accepted but in the logs I can read:
INFO yarn.ApplicationMaster: Waiting for Spark driver to be reachable
My understanding of the situation is that the cluster is unable to talk to the driver in the container but I don't know how to solve this issue.
I am currently using the following configuration:
spark.driver.port set to PORT1 and option -p PORT1:PORT1 passed to the container
spark.driver.host set to 172.17.0.2 (ip of the container)
SPARK_LOCAL_IP set to 172.17.0.2 (ip of the container)
spark.ui.port set to PORT2 and option -p PORT2:PORT2 passed to the container
I have the feeling I should change the SPARK_LOCAL_IP to the host ip but if I do so, SparkUI is unable to start, blocking the process a step before.
Thanks in advance for any ideas / advices !
Good question! First of all, as you know Apache Zeppelin runs interpreters in a separate processes.
In your case, Spark interpreter JVM process hosts a SparkContext and serves as aSparkDriver instance for the yarn-client deployment mode. This process inside the container, according to the Apache Spark documentation, needs to be able to communicate back and forth to\from YARN ApplicationMaster and all SparkWorkers machines of the cluster.
This implies that you have to have number of ports open and manually forwarded between the container and a host machine. Here is an example of a project at ZEPL doing similar job, where it took us 7 ports to get the job done.
Anoter aproach can be running Docker networking in a host mode (though it apparently does not work on os x, due to a recent bug)

How to run Spark Sql on a 10 Node cluster

I am using spark for the first time. I have setup spark on Hadoop 2.7 on a cluster with 10 nodes. On my master node, following are processes running:
hduser#hadoop-master-mp:~$ jps
20102 ResourceManager
19736 DataNode
20264 NodeManager
24762 Master
19551 NameNode
24911 Worker
25423 Jps
Now, I want to write Spark Sql to do a certain computation for 1 GB of file, which is already present in HDFS.
If I go into spark shell on my master node:
spark-shell
and write the following query, will it just run on my master, or will it use all 10 nodes as workers?
scala> sqlContext.sql("CREATE TABLE sample_07 (code string,description string,total_emp int,salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TextFile")
If not, what do I have to do to make my Spark Sql use full cluster?
You need cluster manager to manage master and workers. You can go for either spark standalone or yarn or mesos cluster manager. I would suggest spark standalone cluster manager instead of yarn to just start the things.
To just start it up,
Download spark distribution (pre-compiled for hadoop) on all the nodes and set Hadoop class path and other important configurations in spark-env.sh.
1) Start the master using /sbin/start-master.sh
it will create web interface with port (default 8080). Open the spark master web page and collect the spark master uri that is mentioned in the page.
2) go to all nodes, including the machine u started master, and run slave.
./sbin/start-slave.sh .
Check the master web page again. It should list all the workers on the page. If it hasnt listed then u need to find out the error from logs.
3) Please check the cores & memory that the machine has and the same shown on master web page for each worker. If they are not matching you can play with the commands to allocate them.
Go for spark 1.5.2 or later
please follow the details here
As its just a starting point, let me know if u face any errors i can help u out.

How to connect JMXConsole remotely to Spark streaming application

I have a Spark streaming application running in a yarn-cluster mode reading from a Kafka topic.
I want to connect JMXConsole or the Java visualvm to these remote processes in a Cloudera distribution to gather some performance benchmarks.
How would I go about doing that?
The way I've done this is to set/add the following property (Also start Flight Recorder):
spark.executor.extraJavaOptions=-XX:+UnlockCommercialFeatures -XX:+FlightRecorder -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=0
If you have only one worker running on each box, you can set the port to be fixed. If you have multiple, then you need to go with port 0 and the use lsof to find which port got assigned,.

What is the difference between web UIs on 4040 and 8080?

There are two different web UIs (one is for standalone mode only). Can I use web UI on port 4040 when I am launching Spark in standalone mode? (example:spark-class.cmd org.apache.spark.deploy.master.Master- web ui 8080 is working, 4040 - not.) What is the main difference between these UIs?
Is it possible for me to launch Spark (without hadoop, hdfs, yarn etc), to keep it up and to submit my jars(classes) into it? I want to watch job statistics after it finishes. I am trying something like this:
Server: Spark\bin>spark-class.cmd org.apache.spark.deploy.master.Master
Worker: Spark\bin>spark-class.cmd org.apache.spark.deploy.worker.Worker spark://169.254.8.45:7077 --cores 4 --memory 512M
Submit: Spark\bin>spark-submit.cmd --class demo.TreesSample --master spark://169.254.8.45:7077 file:///E:/spark-demo/target/demo.jar
It runs. It gets new WebUI on port 4040 up for this task. I dont see anything in Master's ui on 8080.
Currently I'm using win7 x64, spark-1.5.2-bin-hadoop2.6. I can switch into linux if it matters.
You should be able to change the web UI port for standalone Master using spark.master.ui.port or SPARK_MASTER_WEBUI_PORT as described in Configuring Ports for Network Security / Standalone mode only.
Standalone Master's web UI is a management console of a cluster manager (that happens to be part of Apache Spark, but could've been a separate product as Hadoop YARN and Apache Mesos). Having said that, it can often be confusing what the two web UIs have in common, and the answer is nothing.
The Spark driver's web UI is to show the progress of your computations (jobs, stages, storage for RDD persistence, broadcasts, accumulators) while standalone Master's web UI is to let you know the current state of your "operating environment" (aka the Spark Standalone cluster).
I leave the other part of your question about History server to #Sumit's answer.
Yes, you can launch the Spark as a standalone server, without any Hadoop or HDFS. Also as soon as you submit your job to master, it will show your job either in in-"Running jobs" or "Jobs Completed" section.
You can also enable History Server for preserving the job Statistics and analyzing the same at a later time -
./sbin/start-history-server.sh
Refer Here for more details on enabling History server

How can I verify that DSE Spark Shell is distributing across the cluster

Is it possible to verify from within the Spark shell what nodes if the shell is connected to the cluster or is running just in local mode? I'm hoping to use that to investigate the following problem:
I've used DSE to setup a small 3 node Cassandra Analytics cluster. I can log onto any of the 3 servers and run dse spark and bring up the Spark shell. I have also verified that all 3 servers have the Spark master configured by running dsetool sparkmaster.
However, when I run any task using the Spark shell, it appears that the it is only running locally. I ran a small test command:
val rdd = sc.cassandraTable("test", "test_table")
rdd.count
When I check the Spark Master webpage, I see that only one server is running the job.
I suspect that when I run dse spark it's running the shell in local mode. I looked up how to specific a master for the Spark 0.9.1 shell and even when I use MASTER=<sparkmaster> dse spark (from the Programming Guide) it still runs only in local mode.
Here's a walkthrough once you've started a DSE 4.5.1 cluster with 3 nodes, all set for Analytics Spark mode.
Once the cluster is up and running, you can determine which node is the Spark Master with command dsetool sparkmaster. This command just prints the current master; it does not affect which node is the master and does not start/stop it.
Point a web browser to the Spark Master web UI at the given IP address and port 7080. You should see 3 workers in the ALIVE state, and no Running Applications. (You may have some DEAD workers or Completed Applications if previous Spark jobs had happened on this cluster.)
Now on one node bring up the Spark shell with dse spark. If you check the Spark Master web UI, you should see one Running Application named "Spark shell". It will probably show 1 core allocated (the default).
If you click on the application ID link ("app-2014...") you'll see the details for that app, including one executor (worker). Any commands you give the Spark shell will run on this worker.
The default configuration is limiting the Spark master to only allowing each application to use 1 core, therefore the work will only be given to a single node.
To change this, login to the Spark master node and sudo edit the file /etc/dse/spark/spark-env.sh. Find the line that sets SPARK_MASTER_OPTS and remove the portion -Dspark.deploy.defaultCores=1. Then restart DSE on this node (sudo service dse restart).
Once it comes up, check the Spark master web UI and repeat the test with the Spark shell. You should see that it's been allocated more cores, and any jobs it performs will happen on multiple nodes.
In a production environment you'd want to set the number of cores more carefully so that a single job doesn't take all the resources.

Resources