I am using spark for the first time. I have setup spark on Hadoop 2.7 on a cluster with 10 nodes. On my master node, following are processes running:
hduser#hadoop-master-mp:~$ jps
20102 ResourceManager
19736 DataNode
20264 NodeManager
24762 Master
19551 NameNode
24911 Worker
25423 Jps
Now, I want to write Spark Sql to do a certain computation for 1 GB of file, which is already present in HDFS.
If I go into spark shell on my master node:
spark-shell
and write the following query, will it just run on my master, or will it use all 10 nodes as workers?
scala> sqlContext.sql("CREATE TABLE sample_07 (code string,description string,total_emp int,salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TextFile")
If not, what do I have to do to make my Spark Sql use full cluster?
You need cluster manager to manage master and workers. You can go for either spark standalone or yarn or mesos cluster manager. I would suggest spark standalone cluster manager instead of yarn to just start the things.
To just start it up,
Download spark distribution (pre-compiled for hadoop) on all the nodes and set Hadoop class path and other important configurations in spark-env.sh.
1) Start the master using /sbin/start-master.sh
it will create web interface with port (default 8080). Open the spark master web page and collect the spark master uri that is mentioned in the page.
2) go to all nodes, including the machine u started master, and run slave.
./sbin/start-slave.sh .
Check the master web page again. It should list all the workers on the page. If it hasnt listed then u need to find out the error from logs.
3) Please check the cores & memory that the machine has and the same shown on master web page for each worker. If they are not matching you can play with the commands to allocate them.
Go for spark 1.5.2 or later
please follow the details here
As its just a starting point, let me know if u face any errors i can help u out.
Related
I am running spark on YARN. In the yarn UI at port 8088 I can see that there is a jobclearly running, and when I click on ApplicationMaster on the right hand side, it shows that the job is progressing in spark.
When I go to the port 18080 for the spark master however, I see that the "Memory in use" is 0, the "Cores in use" is 0, and the "Applications: 0 Running".
How do I get spark master to acknowledge that I am running an application and using cores and memory? The job is progressing obviously because I can see things being written to disk, but why is spark master not up-to-date on it?
Spark Master is a component of Spark Standalone. Yarn and Spark Standalone are both cluster managers. You should just use either one. If you submit an application to Yarn, you won't be able to see it on Spark Master. The cluster resource is managed by Yarn, not Spark Master, in this case.
I noticed that when I start a job in spark submit using yarn, the driver and executor nodes get set randomly. Is it possible to set this manually, so that when I collect the data and write it to file, it can be written on the same node every single time?
As of right now, the parameter I tried playing around with are:
spark.yarn.am.port <driver-ip-address>
and
spark.driver.hostname <driver-ip-address>
Thanks!
If you submit to Yarn with --master yarn --deploy-mode client, the driver will be located on the node you are submitting from.
Also you can configure node labels for executors using property: spark.yarn.executor.nodeLabelExpression
A YARN node label expression that restricts the set of nodes executors will be scheduled on. Only versions of YARN greater than or equal to 2.6 support node label expressions, so when running against earlier versions, this property will be ignored.
Docs - Running Spark on YARN - Latest Documentation
A spark cluster can run in either yarncluster or yarn-client mode.
In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client machine can go away after initiating the application.
In yarn-client mode, the driver runs in the client
process, and the application master is only used for requesting resources from YARN.
So as you see, depending upon the mode, the spark picks up the Application Master. Its not happened randomly until this stage. However, the worker nodes which the application master requests the resource manager to perform tasks will be randomly picked based on the availability of the worker nodes.
Cluster Specifications : Apache Spark on top of Mesos with 5 Vms and HDFS as storage.
spark-env.sh
export SPARK_LOCAL_IP=192.168.xx.xxx #to set the IP address Spark binds to on this node
enter code here`export MESOS_NATIVE_JAVA_LIBRARY="/home/xyz/tools/mesos-1.0.0/build/src/.libs/libmesos-1.0.0.so" #to point to your libmesos.so if you use Mesos
export SPARK_EXECUTOR_URI="hdfs://vm8:9000/spark-2.0.0-bin-hadoop2.7.tgz"
HADOOP_CONF_DIR="/usr/local/tools/hadoop" #To point Spark towards Hadoop configuration files
spark-defaults.conf
spark.executor.uri hdfs://vm8:9000/spark-2.0.0-bin-hadoop2.7.tgz
spark.driver.host 192.168.xx.xxx
spark.rpc netty
spark.rpc.numRetries 5
spark.ui.port 48888
spark.driver.port 48889
spark.port.maxRetries 32
I did some experiments with submitting word-count scala application in cluster mode, I observed that it executes successfully only when it finds driver program (containing main method) from the Vm it was submitted. As per my knowledge scheduling of resources (VMs) is handled by Mesos. for example if i submit my application from vm12 and coincidently if Mesos also schedules vm12 for executing application then it will execute successfully.In contrast it will fail if mesos scheduler decides to allocate let's say vm15.I checked logs in stderr of mesos UI and found error..
16/09/27 11:15:49 ERROR SparkContext: Error initializing SparkContext.
Besides I tried looking for configuration aspects of spark in following link.
[http://spark.apache.org/docs/latest/configuration.html][1] I tried setting rpc as it seemed necessary to keep driver program near to worker-node in LAN.
But couldn't get much insights.
I also tried uploading my code (application) in HDFS and submitting application jar file from HDFS.The same observations I received.
While connecting apache-spark with Mesos according to the documentation in
following link http://spark.apache.org/docs/latest/running-on-mesos.html
I also tried configuring spark-defaults.conf, spark-env.sh in other VM's in order to check if it successfully runs at least from 2 Vm's. That also didn't workout.
Am I missing any conceptual clarity here.?
So how can I make my application run successfully regardless of Vm's I'm submitting from ?
I am a complete novice in Spark and just started exploring more on this. I have chosen the longer path by not installing hadoop using any CDH distribution and i have installed Hadoop from Apache website and setting the config file myself to understand more on the basics.
I have setup a 3 node cluster (All node are VM machine created from ESX server).
I have setup High Availability for both Namenode and ResourceManager by using zookeeper mechanism. All three nodes are being used as DataNode as well.
The Following Daemons are Running across All three Nodes
Daemon in Namenode 1 Daemon In Namenode 2 Daemon in Datanode
8724 QuorumPeerMain 22896 QuorumPeerMain 7379 DataNode
13652 Jps 23780 ResourceManager 7299 JournalNode
9045 DFSZKFailoverController 23220 DataNode 7556 NodeManager
9175 DataNode 23141 NameNode 7246 QuorumPeerMain
9447 NodeManager 27034 Jps 9705 Jps
8922 NameNode 23595 NodeManager
8811 JournalNode 22955 JournalNode
9324 ResourceManager 23055 DFSZKFailoverController
I have setup HA for NN and RM in NameNode 1 & 2.
The Nodes are of very minimum Hardware configuration (4GM RAM each and 20GB Disk Space) , But these are just for testing purpose. so i guess its ok.
I have installed Spark (Compatible version to my installed Hadoop 2.7) in NameNode 1. I am able to start Spark-shell locally and perform basic scala command to create RDD and perform some Actions over it. I also manage to test run SparkPi example as Yarn-Cluster and Yarn-Client deployment Mode. All works well and good.
Now my problem is , In real time scenario , We are going to write (Java, scala or py) based code in our local machine (Not in the nodes which form the Hadoop Cluster). Lets say i have another machine in same network as my HA Cluster is.How do i submit my Job to the Yarn-Cluster (Lets say i want to try submitting SparkPi) example from a host not in HA to the Yarn RM , How do i do this ?
I believe , SPARK has to be installed in machine where i am writing my code from (Is my assumption correct) and No spark needs to be installed in the HA Cluster. I also want to get the output of the submitted job back to the Host from where it was submitted. I have no idea what needed to be done to make this work.
I have heard of Spark JobServer , Is this what i need to get this all up and running ? I believe you guys can help me out with this confusion. I just could not find any document which clearly specify the steps to follow to get this done. Can i submit a job from Windows based machine to my HA cluster setup in unix environment ?
Spark JobServer provides rest interface for your requirement. Apart from that there are other features.
See https://github.com/spark-jobserver/spark-jobserver for more information.
In order to submit spark jobs to the cluster your machine have to become a "gateway node". That basically means you have hadoop binaries/libraries/configs installed on that machine, but there are no hadoop daemons running on it.
Once you have it setup, you should be able to run hdfs commands against your cluster from that machine (like hdfs dfs -ls /), submit yarn applications to the cluster (yarn jar /opt/cloudera/parcels/CDH/jars/hadoop-examples.jar pi 3 100).
After that step you can install spark on your gateway machine and start submitting spark jobs. If you are going to use Spark-on-yarn, this is the only machine spark needs to be installed on.
You (your code) is the one responsible for getting the output of the job. You could choose to save the result in HDFS (the most common choice), print it to the console, etc... Spark's History Server is for debugging purposes.
Is it possible to verify from within the Spark shell what nodes if the shell is connected to the cluster or is running just in local mode? I'm hoping to use that to investigate the following problem:
I've used DSE to setup a small 3 node Cassandra Analytics cluster. I can log onto any of the 3 servers and run dse spark and bring up the Spark shell. I have also verified that all 3 servers have the Spark master configured by running dsetool sparkmaster.
However, when I run any task using the Spark shell, it appears that the it is only running locally. I ran a small test command:
val rdd = sc.cassandraTable("test", "test_table")
rdd.count
When I check the Spark Master webpage, I see that only one server is running the job.
I suspect that when I run dse spark it's running the shell in local mode. I looked up how to specific a master for the Spark 0.9.1 shell and even when I use MASTER=<sparkmaster> dse spark (from the Programming Guide) it still runs only in local mode.
Here's a walkthrough once you've started a DSE 4.5.1 cluster with 3 nodes, all set for Analytics Spark mode.
Once the cluster is up and running, you can determine which node is the Spark Master with command dsetool sparkmaster. This command just prints the current master; it does not affect which node is the master and does not start/stop it.
Point a web browser to the Spark Master web UI at the given IP address and port 7080. You should see 3 workers in the ALIVE state, and no Running Applications. (You may have some DEAD workers or Completed Applications if previous Spark jobs had happened on this cluster.)
Now on one node bring up the Spark shell with dse spark. If you check the Spark Master web UI, you should see one Running Application named "Spark shell". It will probably show 1 core allocated (the default).
If you click on the application ID link ("app-2014...") you'll see the details for that app, including one executor (worker). Any commands you give the Spark shell will run on this worker.
The default configuration is limiting the Spark master to only allowing each application to use 1 core, therefore the work will only be given to a single node.
To change this, login to the Spark master node and sudo edit the file /etc/dse/spark/spark-env.sh. Find the line that sets SPARK_MASTER_OPTS and remove the portion -Dspark.deploy.defaultCores=1. Then restart DSE on this node (sudo service dse restart).
Once it comes up, check the Spark master web UI and repeat the test with the Spark shell. You should see that it's been allocated more cores, and any jobs it performs will happen on multiple nodes.
In a production environment you'd want to set the number of cores more carefully so that a single job doesn't take all the resources.