How can I verify that DSE Spark Shell is distributing across the cluster - apache-spark

Is it possible to verify from within the Spark shell what nodes if the shell is connected to the cluster or is running just in local mode? I'm hoping to use that to investigate the following problem:
I've used DSE to setup a small 3 node Cassandra Analytics cluster. I can log onto any of the 3 servers and run dse spark and bring up the Spark shell. I have also verified that all 3 servers have the Spark master configured by running dsetool sparkmaster.
However, when I run any task using the Spark shell, it appears that the it is only running locally. I ran a small test command:
val rdd = sc.cassandraTable("test", "test_table")
rdd.count
When I check the Spark Master webpage, I see that only one server is running the job.
I suspect that when I run dse spark it's running the shell in local mode. I looked up how to specific a master for the Spark 0.9.1 shell and even when I use MASTER=<sparkmaster> dse spark (from the Programming Guide) it still runs only in local mode.

Here's a walkthrough once you've started a DSE 4.5.1 cluster with 3 nodes, all set for Analytics Spark mode.
Once the cluster is up and running, you can determine which node is the Spark Master with command dsetool sparkmaster. This command just prints the current master; it does not affect which node is the master and does not start/stop it.
Point a web browser to the Spark Master web UI at the given IP address and port 7080. You should see 3 workers in the ALIVE state, and no Running Applications. (You may have some DEAD workers or Completed Applications if previous Spark jobs had happened on this cluster.)
Now on one node bring up the Spark shell with dse spark. If you check the Spark Master web UI, you should see one Running Application named "Spark shell". It will probably show 1 core allocated (the default).
If you click on the application ID link ("app-2014...") you'll see the details for that app, including one executor (worker). Any commands you give the Spark shell will run on this worker.
The default configuration is limiting the Spark master to only allowing each application to use 1 core, therefore the work will only be given to a single node.
To change this, login to the Spark master node and sudo edit the file /etc/dse/spark/spark-env.sh. Find the line that sets SPARK_MASTER_OPTS and remove the portion -Dspark.deploy.defaultCores=1. Then restart DSE on this node (sudo service dse restart).
Once it comes up, check the Spark master web UI and repeat the test with the Spark shell. You should see that it's been allocated more cores, and any jobs it performs will happen on multiple nodes.
In a production environment you'd want to set the number of cores more carefully so that a single job doesn't take all the resources.

Related

How are spark jobs submitted in cluster mode?

I know there is information worth 10 google pages on this but, all of them tell me to just put --master yarn in the spark-submit command. But, in cluster mode, how can my local laptop even know what that means? Let us say I have my laptop and a running dataproc cluster. How can I use spark-submit from my laptop to submit a job to this cluster?
Most of the documentation on running a Spark application in cluster mode assumes that you are already on the same cluster where YARN/Hadoop are configured (e.g. you are ssh'ed in), in which case most of the time Spark will pick up the appropriate local configs and "just work".
This is same for Dataproc: if you ssh onto the Dataproc master node, you can just run spark-submit --master yarn. More detailed instructions can be found in the documentation.
If you are trying to run applications locally on your laptop, this is more difficult. You will need to set up an ssh tunnel to the cluster, and then locally create configuration files that tell Spark how to reach the master via the tunnel.
Alternatively, you can use the Dataproc jobs API to submit jobs to the cluster without having to directly connect. The one caveat is that you will have to use properties to tell Spark to run in cluster mode instead of client mode (--properties spark.submit.deployMode=cluster). Note that when submitting jobs via the Dataproc API, the difference between client and cluster mode is much less pressing because in either case the Spark driver will actually run on the cluster (on the master or a worker respectively), not on your local laptop.

Can You Use a Script to Start Spark Cluster Nodes?

I'm running Hadoop and Spark on a four-node cluster in AWS EC2.
After doing a lot of web research, it seems the accepted way to start Spark on a cluster (once Hadoop is running) is to:
1) Log into the master node and run start-master.sh.
2) Log into each slave node and run start-slave.sh, passing it the DNS and port information for the master node.
My question is: If there are, let's say 20 nodes, this is pretty tedious and time consuming. Is there a way to start Spark from some localized location the way Hadoop is started? When you run Hadoop from the master node, it starts all the slave nodes remotely. I'm looking for a solution like that, or for a python script that can SSH into the nodes and start them.
You could use Apache Ambari to manage the whole cluster, which would SSH to all nodes for you
Otherwise, you could use a system like Ansible to configure and start all the services
Sounds like you're only using Spark Standalone, though, not YARN, because there is no start-slaves script for YARN

Understanding Spark Submit Yarn Client vs Cluster mode [duplicate]

TL;DR: In a Spark Standalone cluster, what are the differences between client and cluster deploy modes? How do I set which mode my application is going to run on?
We have a Spark Standalone cluster with three machines, all of them with Spark 1.6.1:
A master machine, which also is where our application is run using spark-submit
2 identical worker machines
From the Spark Documentation, I read:
(...) For standalone clusters, Spark currently supports two deploy modes. In client mode, the driver is launched in the same process as the client that submits the application. In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.
However, I don't really understand the practical differences by reading this, and I don't get what are the advantages and disadvantages of the different deploy modes.
Additionally, when I start my application using start-submit, even if I set the property spark.submit.deployMode to "cluster", the Spark UI for my context shows the following entry:
So I am not able to test both modes to see the practical differences. That being said, my questions are:
1) What are the practical differences between Spark Standalone client deploy mode and cluster deploy mode? What are the pro's and con's of using each one?
2) How to I choose which one my application is going to be running on, using spark-submit?
What are the practical differences between Spark Standalone client
deploy mode and cluster deploy mode? What are the pro's and con's of
using each one?
Let's try to look at the differences between client and cluster mode.
Client:
Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work.
Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).
Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program.
If the driver process dies, you need an external monitoring system to reset it's execution.
Cluster:
Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
Driver runs as a dedicated, standalone process inside the Worker.
Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).
Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.
When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.
Which one is better? Not sure, that's actually for you to experiment and decide. This is no better decision here, you gain things from the former and latter, it's up to you to see which one works better for your use-case.
How to I choose which one my application is going to be running on,
using spark-submit
The way to choose which mode to run in is by using the --deploy-mode flag. From the Spark Configuration page:
/bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
Let's say you are going to perform a spark submit in EMR by doing SSH to the master node.
If you are providing the option --deploy-mode cluster, then following things will happen.
You won't be able to see the detailed logs in the terminal.
Since driver is not created in the Master itself, you won't be able to terminate the job from the terminal.
But in case of --deploy-mode client:
You will be able to see the detailed logs in the terminal.
You will be able to terminate the job from the terminal itself.
These are the basic things that I have noticed till now.
I'm also having the same scenario, here master node use a standalone ec2 cluster. In this setup client mode is appropriate. In this driver is launched directly with in the spark-submit process which acts as a client to the cluster. The Input & output of the application is attached to the console.Thus, this mode is especially suitable for applications that involve REPL.
Else if your application is submitted from a machine far from the worker machines then it is quite common to use cluster mode to minimize the network latency b/w driver & executor.

Spark yarn cluster vs client - how to choose which one to use?

The spark docs have the following paragraph that describes the difference between yarn client and yarn cluster:
There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
I'm assuming there are two choices for a reason. If so, how do you choose which one to use?
Please use facts to justify your response so that this question and answer(s) meet stackoverflow's requirements.
There are a few similar questions on stackoverflow, however those questions focus on the difference between the two approaches, but don't focus on when one approach is more suitable than the other.
A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).
Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for Mesos clusters. Currently only YARN supports cluster mode for Python applications." -- Submitting Applications
What I understand from this is that both strategies use the cluster to distribute tasks; the difference is where the "driver program" runs: locally with spark-submit, or, also in the cluster.
When you should use either of them is detailed in the quote above, but I also did another thing: for big jars, I used rsync to copy them to the cluster (or even to master node) with 100 times the network speed, and then submitted from the cluster. This can be better than "cluster mode" for big jars. Note that client mode does not probably transfer the jar to the master. At that point the difference between the 2 is minimal. Probably client mode is better when the driver program is idle most of the time, to make full use of cores on the local machine and perhaps avoid transferring the jar to the master (even on loopback interface a big jar takes quite a bit of seconds). And with client mode you can transfer (rsync) the jar on any cluster node.
On the other hand, if the driver is very intensive, in cpu or I/O, cluster mode may be more appropriate, to better balance the cluster (in client mode, the local machine would run both the driver and as many workers as possible, making it over loaded and making it that local tasks will be slower, making it such that the whole job may end up waiting for a couple of tasks from the local machine).
Conclusion :
To sum up, if I am in the same local network with the cluster, I would
use the client mode and submit it from my laptop. If the cluster is
far away, I would either submit locally with cluster mode, or rsync
the jar to the remote cluster and submit it there, in client or
cluster mode, depending on how heavy the driver program is on
resources.*
AFAIK With the driver program running in the cluster, it is less vulnerable to remote disconnects crashing the driver and the entire spark job.This is especially useful for long running jobs such as stream processing type workloads.
Spark Jobs Running on YARN
When running Spark on YARN, each Spark executor runs as a YARN container. Where MapReduce schedules a container and fires up a JVM for each task, Spark hosts multiple tasks within the same container. This approach enables several orders of magnitude faster task startup time.
Spark supports two modes for running on YARN, “yarn-cluster” mode and “yarn-client” mode. Broadly, yarn-cluster mode makes sense for production jobs, while yarn-client mode makes sense for interactive and debugging uses where you want to see your application’s output immediately.
Understanding the difference requires an understanding of YARN’s Application Master concept. In YARN, each application instance has an Application Master process, which is the first container started for that application. The application is responsible for requesting resources from the ResourceManager, and, when allocated them, telling NodeManagers to start containers on its behalf. Application Masters obviate the need for an active client — the process starting the application can go away and coordination continues from a process managed by YARN running on the cluster.
In yarn-cluster mode, the driver runs in the Application Master. This means that the same process is responsible for both driving the application and requesting resources from YARN, and this process runs inside a YARN container. The client that starts the app doesn’t need to stick around for its entire lifetime.
yarn-cluster mode
The yarn-cluster mode is not well suited to using Spark interactively, but the yarn-client mode is. Spark applications that require user input, like spark-shell and PySpark, need the Spark driver to run inside the client process that initiates the Spark application. In yarn-client mode, the Application Master is merely present to request executor containers from YARN. The client communicates with those containers to schedule work after they start:
yarn-client mode
This table offers a concise list of differences between these modes:
Reference: https://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ - Apache Spark Resource Management and YARN App Models (web.archive.com mirror)
In yarn-cluster mode, the driver program will run on the node where application master is running where as in yarn-client mode the driver program will run on the node on which job is submitted on centralized gateway node.
Both answers by Ram and Chavati/Wai Lee are useful and helpful, but here I just want to added a couple of other points:
Much has been written about the proximity of driver to the executors, and that is a big factor. The other factors are:
Will the driver process be around until execution of job is finished?
How's returned data being handled?
#1. In spark client mode, the driver process must be up and running the whole time when the spark job is in execution. So if you have a truly long job that say take many hours to run, you need to make sure the driver process is still up and running, and that the driver session is not auto-logout.
On the other hand, after submitting a job to run in cluster mode, the process can go away. The cluster mode will keep running. So this is typically how a production job will run: the job can be triggered by a timer, or by an external event and then the job will run to its completion without worrying about the lifetime of the process submitting the spark job.
#2. In client mode, you can call sc.collect() to gather all the data back from all executors, and then write/save the returned data to a local Linux file on local disk. Now this may not work for cluster mode, as the 'driver' typically run in a different remote host. The data written up therefore need to be persisted in a common mounted volume (such as GPFS, NFS) or in distributed file system like HDFS. If the job is running under Hadoop/YARN, the more common way for cluster mode is simply ask each executor to persist the data to HDFS, and not to run collect( ) at all. Collect() actually have scalability issue when there are a large number of executors returning large amount of data - it can overwhelm the driver process.

How to run Spark Sql on a 10 Node cluster

I am using spark for the first time. I have setup spark on Hadoop 2.7 on a cluster with 10 nodes. On my master node, following are processes running:
hduser#hadoop-master-mp:~$ jps
20102 ResourceManager
19736 DataNode
20264 NodeManager
24762 Master
19551 NameNode
24911 Worker
25423 Jps
Now, I want to write Spark Sql to do a certain computation for 1 GB of file, which is already present in HDFS.
If I go into spark shell on my master node:
spark-shell
and write the following query, will it just run on my master, or will it use all 10 nodes as workers?
scala> sqlContext.sql("CREATE TABLE sample_07 (code string,description string,total_emp int,salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TextFile")
If not, what do I have to do to make my Spark Sql use full cluster?
You need cluster manager to manage master and workers. You can go for either spark standalone or yarn or mesos cluster manager. I would suggest spark standalone cluster manager instead of yarn to just start the things.
To just start it up,
Download spark distribution (pre-compiled for hadoop) on all the nodes and set Hadoop class path and other important configurations in spark-env.sh.
1) Start the master using /sbin/start-master.sh
it will create web interface with port (default 8080). Open the spark master web page and collect the spark master uri that is mentioned in the page.
2) go to all nodes, including the machine u started master, and run slave.
./sbin/start-slave.sh .
Check the master web page again. It should list all the workers on the page. If it hasnt listed then u need to find out the error from logs.
3) Please check the cores & memory that the machine has and the same shown on master web page for each worker. If they are not matching you can play with the commands to allocate them.
Go for spark 1.5.2 or later
please follow the details here
As its just a starting point, let me know if u face any errors i can help u out.

Resources