I am interested in testing Spark running on Mesos. I created a Hadoop 2.6.0 single-node cluster in my Virtualbox and installed Spark on it. I can successfully process files in HDFS using Spark.
Then I installed Mesos Master and Slave on the same node. I tried to run Spark as a framework in Mesos using these instructions. I get the following error with Spark:
WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient resources
Sparkshell is successfully registered as a framework in the Mesos. Is there anything wrong with using a single-node setup? Or whether I need to add more Spark worker nodes?
I am very new to Spark and my aim is to just test Spark, HDFS, and Mesos.
If you have allocated enough resources for spark slaves, the cause might be firewall blocking the communication. Take a look at my other answer:
Apache Spark on Mesos: Initial job has not accepted any resources
Related
I'm trying to understand the details of how Spark prepares the executors. In order to do this I tried to debug org.apache.spark.executor.CoarseGrainedExecutorBackend and invoked
Thread.currentThread().getContextClassLoader.getResource("")
It points out to the following directory:
/hadoop/yarn/local/usercache/_MY_USER_NAME_/appcache/application_1507907717252_15771/container_1507907717252_15771_01_000002/
Looking at the directory I found the following files:
default_container_executor_session.sh
default_container_executor.sh
launch_container.sh
__spark_conf__
__spark_libs__
The question is who delivers the files to each executor and then just runs CoarseGrainedExecutorBackend with the appropriate classpath? What are the scripts? Are they all YARN-autogenerated?
I looked at org.apache.spark.deploy.SparkSubmit, but didn't find anything useful inside.
Ouch...you're asking for quite a lot of details on how Spark communicates with cluster managers while requesting resources. Let me give you some information. Keep asking if you want more...
You are using Hadoop YARN as the cluster manager for Spark applications. Let's focus on this particular cluster manager only (as there are others that Spark supports like Apache Mesos, Spark Standalone, DC/OS and soon Kubernetes that have their own ways to deal with Spark deployments).
By default, while submitting a Spark application using spark-submit, the Spark application (i.e. the SparkContext it uses actually) requests three YARN containers. One container is for that Spark application's ApplicationMaster that knows how to talk to YARN and request two other YARN containers for two Spark executors.
You could review the YARN official documentation's Apache Hadoop YARN and Hadoop: Writing YARN Applications to dig deeper into the YARN internals.
While submitting the Spark application, Spark's ApplicationMaster is submitted to YARN using the YARN "protocol" that requires that the request for the very first YARN container (container 0) uses ContainerLaunchContext that holds all the necessary launch details (see Client.createContainerLaunchContext).
who delivers the files to each executor
That's how YARN gets told how to launch the ApplicationMaster for the Spark application. While fulfilling the request for a ApplicationMaster container, YARN downloads necessary files which you found in the container's working space.
That's very internal to how any YARN application works on YARN and has (almost) nothing to do with Spark.
The code that's responsible for the communication is in Spark's Client, esp. Client.submitApplication.
and then just runs CoarseGrainedExecutorBackend with the appropriate classpath.
Quoting Mastering Apache Spark 2 gitbook:
CoarseGrainedExecutorBackend is a standalone application that is started in a resource container when (...) Spark on YARN’s ExecutorRunnable is started.
ExecutorRunnable is started when when Spark on YARN's YarnAllocator schedules it in allocated YARN resource containers.
What are the scripts? Are they all YARN-autogenerated?
Kind of.
Some are prepared by Spark as part of a Spark application submission while others are YARN-specific.
Enable DEBUG logging level in your Spark application and you'll see the file transfer.
You can find more information in the Spark official documentation's Running Spark on YARN and the Mastering Apache Spark 2 gitbook of mine.
In a Spark standalone cluster, what is exactly the role of the master (a node started with start_master.sh script)?
I understand that is the node that receives the jobs from the submit-job.sh script, but what is its role when processing a job?
I'm seeing in the web UI that always delivers the job to a slave (a node started with start_slave.sh) and is not participating from processing, Am I right? In that case, should I also run also the script start_slave.sh in the same machine than master to to take advantage of its resources (cpu and memory)?
Thanks in advance.
Spark runs in the following cluster modes:
Local
Standalone
Mesos
Yarn
The above are cluster modes which offer resources to Spark Applications
Spark standalone mode is master slave architecture, we have Spark Master and Spark Workers. Spark Master runs in one of the cluster nodes and Spark Workers run on the Slave nodes of the cluster.
Spark Master (often written standalone Master) is the resource manager
for the Spark Standalone cluster to allocate the resources (CPU, Memory, Disk etc...) among the
Spark applications. The resources are used to run the Spark Driver and Executors.
Spark Workers report to Spark Master about resources information on the Slave nodes.
[apache-spark]
Spark standalone comes with its own resource manager. Think about Spark Master/Worker as YARN ResourceManager/NodeManager.
I am a beginner in Apache-spark!
I have setup Spark standalone cluster using 4 PCs.
I want to use Mesos with existing Spark standalone cluster. But I read that I need to install Mesos first then configure the spark.
I have also seen the Documentation of Spark on setting with Mesos, but it is not helpful for me.
So how to configure Mesos with existing spark standalone cluster?
Mesos is an alternative cluster manager to standalone Spark manger. You don't use it with, you use it instead of.
to create Mesos cluster follow https://mesos.apache.org/gettingstarted/
make sure to distribute Mesos native library is available on the machine you use to submit jobs
for cluster mode start Mesos dispatcher (sbin/start-mesos-dispatcher.sh).
submit application using Mesos master URI (client mode) or dispatcher URI (cluster mode).
I would like to run multiple spark jobs on my Mesos cluster, and have all spark jobs share the same spark framework. Is this possible?
I have tried running the MesosClusterDispatcher and have the spark jobs connect to the dispatcher, but each spark job launches its own "Spark Framework" (I have tried running both client-mode and cluster-mode).
Is this the expected behaviour?
Is it possible to share the same spark-framework among multiple spark jobs?
It is normal and it's the expected behaviour.
In Mesos as far as I know, SparkDispatcher is in charge of allocate resources for your Spark Driver which will act as a framework. Once Spark driver has been allocated, it is responsible for talk to Mesos and accept offers to allocate the executors where tasks will be executed.
I understood YARN and Spark. But I want to know when I need to use Yarn and Spark processing engine. What are the different case studies in that I can identify the difference between YARN and Spark?
You cannot compare Yarn and Spark directly per se. Yarn is a distributed container manager, like Mesos for example, whereas Spark is a data processing tool. Spark can run on Yarn, the same way Hadoop Map Reduce can run on Yarn. It just happens that Hadoop Map Reduce is a feature that ships with Yarn, when Spark is not.
If you mean comparing Map Reduce and Spark, I suggest reading this other answer.
Apache Spark can be run on YARN, MESOS or StandAlone Mode.
Spark in StandAlone mode - it means that all the resource management and job scheduling are taken care Spark inbuilt.
Spark in YARN - YARN is a resource manager introduced in MRV2, which not only supports native hadoop but also Spark, Kafka, Elastic Search and other custom applications.
Spark in Mesos - Spark also supports Mesos, this is one more type of resource manager.
Advantages of Spark on YARN
YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.
YARN schedulers can be used for spark jobs, Only With YARN, Spark can run against Kerberized Hadoop clusters and uses secure authentication between its processes.
Link for more documentation on YARN, Spark.
We can conclude saying this, if you want to build a small and simple cluster independent of everything go for standalone. If you want to use existing hadoop cluster go for YARN/Mesos.