Spark program difference in local mode and cluster - apache-spark

If i write a spark program and run it in stand alone mode and when I want to deploy it in a cluster, do I have to change my program codes or no change in codes needed? Is spark programming independent of number of clusters?

I don't think you need to make any changes. Your program should run the same way as it run in local mode.
Yes, Spark programs are independent of clusters, until and unless you are using something specific to cluster. Normally this is managed by the YARN.

You just need to set option master to yarn or other resource manager, when you want to run it on cluster.
If you want to run it locally just use local[*] by using the number of threads which equal to your machine cores.

Related

how many JVMs are running when Spark Local mode is used?

newbie here, but really enjoyed Spark so far.
I did the following (using a laptop, running Windows 7):
start the master by using command prompt window:
spark-class org.apache.spark.deploy.master.Master
start one worker by typing the following:
spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077
repeat step 2, in other words, start another worker by using the same above command.
Now, I have one master, two workers, all in the same physical machine. Based on what I have been reading, this should be considered as the "local mode"... not sure about this, hope someone can confirm?
Also, from what I have read, local mode should have master and workers in one SINGLE JVM. But by running a small utility code, I can see that master is in one JVM, and two workers each stays in one JVM, so there are three JVMs, and they are different JVMs.
Can someone tell me which part I did wrong, or, what is the problem with my understanding?
Also, for this local model, there is no cluster manager, right?
Many thanks!
Local mode is a single JVM. Local mode is when you specify the master, via --master command line switch, as local[*]. This can be done via spark-submit or spark-shell.
This explains it pretty well.
Now, I have one master, two workers, all in the same physical machine. Based on what I have been reading, this should be considered as the "local mode"... not sure about this, hope someone can confirm?
It is not. It is a standalone mode, where you Spark's own cluster manager. Unlike local it is fully distributed. It will use:
For cluster manager:
Single JVM for each cluster master (one in your case, more in HA mode).
Single JVM for each started worker.
Optionally additional JVMs for services like history server, or shuffle service.
For application:
Single JVM for the driver.
Single JVM for each executor.
In local mode there is only one JVM, as already stated by Greg

How to set up Spark cluster on laptop?

I am completely new at Spark and try to run a tutorial example, which counts the number of lines containing 'a' and 'b' in a text file in the local file system.
I am running it with SparkContext with master = "local", i.e. Spark is running in the same JVM. Now I would like to try it in "cluster mode".
So I would like to run a Spark cluster of a cluster manager and two worker nodes locally on my Mac laptop. What is the easiest way to do that ?
Quoting the official documentation about Spark Standalone Mode:
./sbin/start-master.sh
./sbin/start-slave.sh <master-spark-URL>
In other words, you should start the standalone Master first (using ./sbin/start-master.sh) followed by starting one or more standalone Workers (using ./sbin/start-slave.sh).
Quoting the docs again:
Once you have started a worker, look at the master's web UI (http://localhost:8080 by default)
You're done. Congrats!
If you are looking to learn various ways to use SPARK I would suggest you to download the CLOUDERA quick start VM's which will give a simple cluster setup.
All you need to do is download the quick start VM and play around with the settings accordingly.
The quick start VM can be found here
Reference:Cloudera VM

Spark yarn cluster vs client - how to choose which one to use?

The spark docs have the following paragraph that describes the difference between yarn client and yarn cluster:
There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
I'm assuming there are two choices for a reason. If so, how do you choose which one to use?
Please use facts to justify your response so that this question and answer(s) meet stackoverflow's requirements.
There are a few similar questions on stackoverflow, however those questions focus on the difference between the two approaches, but don't focus on when one approach is more suitable than the other.
A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).
Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for Mesos clusters. Currently only YARN supports cluster mode for Python applications." -- Submitting Applications
What I understand from this is that both strategies use the cluster to distribute tasks; the difference is where the "driver program" runs: locally with spark-submit, or, also in the cluster.
When you should use either of them is detailed in the quote above, but I also did another thing: for big jars, I used rsync to copy them to the cluster (or even to master node) with 100 times the network speed, and then submitted from the cluster. This can be better than "cluster mode" for big jars. Note that client mode does not probably transfer the jar to the master. At that point the difference between the 2 is minimal. Probably client mode is better when the driver program is idle most of the time, to make full use of cores on the local machine and perhaps avoid transferring the jar to the master (even on loopback interface a big jar takes quite a bit of seconds). And with client mode you can transfer (rsync) the jar on any cluster node.
On the other hand, if the driver is very intensive, in cpu or I/O, cluster mode may be more appropriate, to better balance the cluster (in client mode, the local machine would run both the driver and as many workers as possible, making it over loaded and making it that local tasks will be slower, making it such that the whole job may end up waiting for a couple of tasks from the local machine).
Conclusion :
To sum up, if I am in the same local network with the cluster, I would
use the client mode and submit it from my laptop. If the cluster is
far away, I would either submit locally with cluster mode, or rsync
the jar to the remote cluster and submit it there, in client or
cluster mode, depending on how heavy the driver program is on
resources.*
AFAIK With the driver program running in the cluster, it is less vulnerable to remote disconnects crashing the driver and the entire spark job.This is especially useful for long running jobs such as stream processing type workloads.
Spark Jobs Running on YARN
When running Spark on YARN, each Spark executor runs as a YARN container. Where MapReduce schedules a container and fires up a JVM for each task, Spark hosts multiple tasks within the same container. This approach enables several orders of magnitude faster task startup time.
Spark supports two modes for running on YARN, “yarn-cluster” mode and “yarn-client” mode. Broadly, yarn-cluster mode makes sense for production jobs, while yarn-client mode makes sense for interactive and debugging uses where you want to see your application’s output immediately.
Understanding the difference requires an understanding of YARN’s Application Master concept. In YARN, each application instance has an Application Master process, which is the first container started for that application. The application is responsible for requesting resources from the ResourceManager, and, when allocated them, telling NodeManagers to start containers on its behalf. Application Masters obviate the need for an active client — the process starting the application can go away and coordination continues from a process managed by YARN running on the cluster.
In yarn-cluster mode, the driver runs in the Application Master. This means that the same process is responsible for both driving the application and requesting resources from YARN, and this process runs inside a YARN container. The client that starts the app doesn’t need to stick around for its entire lifetime.
yarn-cluster mode
The yarn-cluster mode is not well suited to using Spark interactively, but the yarn-client mode is. Spark applications that require user input, like spark-shell and PySpark, need the Spark driver to run inside the client process that initiates the Spark application. In yarn-client mode, the Application Master is merely present to request executor containers from YARN. The client communicates with those containers to schedule work after they start:
yarn-client mode
This table offers a concise list of differences between these modes:
Reference: https://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ - Apache Spark Resource Management and YARN App Models (web.archive.com mirror)
In yarn-cluster mode, the driver program will run on the node where application master is running where as in yarn-client mode the driver program will run on the node on which job is submitted on centralized gateway node.
Both answers by Ram and Chavati/Wai Lee are useful and helpful, but here I just want to added a couple of other points:
Much has been written about the proximity of driver to the executors, and that is a big factor. The other factors are:
Will the driver process be around until execution of job is finished?
How's returned data being handled?
#1. In spark client mode, the driver process must be up and running the whole time when the spark job is in execution. So if you have a truly long job that say take many hours to run, you need to make sure the driver process is still up and running, and that the driver session is not auto-logout.
On the other hand, after submitting a job to run in cluster mode, the process can go away. The cluster mode will keep running. So this is typically how a production job will run: the job can be triggered by a timer, or by an external event and then the job will run to its completion without worrying about the lifetime of the process submitting the spark job.
#2. In client mode, you can call sc.collect() to gather all the data back from all executors, and then write/save the returned data to a local Linux file on local disk. Now this may not work for cluster mode, as the 'driver' typically run in a different remote host. The data written up therefore need to be persisted in a common mounted volume (such as GPFS, NFS) or in distributed file system like HDFS. If the job is running under Hadoop/YARN, the more common way for cluster mode is simply ask each executor to persist the data to HDFS, and not to run collect( ) at all. Collect() actually have scalability issue when there are a large number of executors returning large amount of data - it can overwhelm the driver process.

Is spark or spark with mesos the easiest to start with?

If I want a simple setup that would give me a quick start: would a combination of apache-spark and mesos would be the easiest? or maybe apache-spark alone would be better because....i.e. mesos would add complexity to the process given what it does, or maybe mesos does way so many things that would be hard to deal with spark alone, etc...
All I want is to be able to submit jobs and manage the cluster and jobs easily, nothing fancy for now, is spark or spark/mesos better or something else...
The easiest way to start using Spark is starting stand alone spark cluster on EC2.
It is as easy as running single script - spark-ec2 and it will do the rest for you.
The only case when stand alone cluster may not suit you - if you want to run more then single spark job at a time (at least it was the case with Spark 1.1).
For me personally the stand alone Spark cluster was good enough for a long time when I was running ad-hoc jobs - analyzing company's logs on S3 and learning Spark, and then destroy the cluster.
If you want to run more than one Spark at a time - I would go with Mesos.
Alternative would be to install CDH from Cloudera which is relatively easy (they provide install scripts and install instructions) and it is available for free.
CDH would provide you powerful tools to manage the cluster.
Using CDH for running Spark - they use YARN, and we have one or another issue from time to time with running Spark on YARN.
The main disadvantage to me - CDHs provider its own build of Spark - so it usually one minor version behind, which is a lot for such rapid progressing project as Spark.
So I would try Mesos for running Spark if I need to run more then one job at a time.
Just for completeness, Hortonworks provides downloadable HDP sandbox VM as well as supports Spark on HDP. It is also a good starting point.
Additionally, you can spin off your own cluster. I do thisonmy laptop, not for real big data usecases but for learning with moderate amount of data.
import subprocess as s
from time import sleep
cmd = "D:\\spark\\spark-1.3.1-bin-hadoop2.6\\spark-1.3.1-bin-hadoop2.6\\spark-1.3.1-bin-hadoop2.6\\bin\\spark-class.cmd"
master = "org.apache.spark.deploy.master.Master"
worker = "org.apache.spark.deploy.worker.Worker"
masterUrl="spark://BigData:7077"
cmds={"masters":1,"workers":3}
masterProcess=[cmd,master]
workerProcess=[cmd,worker,masterUrl]
noWorker = 3
pMaster = s.Popen(masterProcess)
sleep(3)
pWorkers = []
for i in range(noWorker):
pw = s.Popen(workerProcess)
pWorkers.append(pw)
The code above starts master and 3 workers, which I can monitor using the UI. This is just to get going and if you need aquick local set up.

Setting up a (Linux) Hadoop cluster

Do you need to set up a Linux cluster first in order to setup a Hadoop cluster ?
No. Hadoop has its own software to manage a "cluster". Just install linux and make sure the machines can talk to each other.
Deploying the Hadoop software, along with the appropriate config files, and starting it on each node (which Hadoop can do automatically) creates the cluster from the Linux machines you have. So, no, by that definition you don't need to have a separate linux cluster. If your question is whether or not you need to have a multiple-machine cluster to use Hadoop: no, you can run Hadoop on a single machine for either testing or small-sized jobs, via either local mode (where everything is confined to a single process) or pseudodistributed mode (where you trick Hadoop into thinking it's running on multiple computers).

Resources