When does spark master show that memory is in use and çores are in use? - apache-spark

I am running spark on YARN. In the yarn UI at port 8088 I can see that there is a jobclearly running, and when I click on ApplicationMaster on the right hand side, it shows that the job is progressing in spark.
When I go to the port 18080 for the spark master however, I see that the "Memory in use" is 0, the "Cores in use" is 0, and the "Applications: 0 Running".
How do I get spark master to acknowledge that I am running an application and using cores and memory? The job is progressing obviously because I can see things being written to disk, but why is spark master not up-to-date on it?

Spark Master is a component of Spark Standalone. Yarn and Spark Standalone are both cluster managers. You should just use either one. If you submit an application to Yarn, you won't be able to see it on Spark Master. The cluster resource is managed by Yarn, not Spark Master, in this case.

Related

How does Spark prepare executors on Hadoop YARN?

I'm trying to understand the details of how Spark prepares the executors. In order to do this I tried to debug org.apache.spark.executor.CoarseGrainedExecutorBackend and invoked
Thread.currentThread().getContextClassLoader.getResource("")
It points out to the following directory:
/hadoop/yarn/local/usercache/_MY_USER_NAME_/appcache/application_1507907717252_15771/container_1507907717252_15771_01_000002/
Looking at the directory I found the following files:
default_container_executor_session.sh
default_container_executor.sh
launch_container.sh
__spark_conf__
__spark_libs__
The question is who delivers the files to each executor and then just runs CoarseGrainedExecutorBackend with the appropriate classpath? What are the scripts? Are they all YARN-autogenerated?
I looked at org.apache.spark.deploy.SparkSubmit, but didn't find anything useful inside.
Ouch...you're asking for quite a lot of details on how Spark communicates with cluster managers while requesting resources. Let me give you some information. Keep asking if you want more...
You are using Hadoop YARN as the cluster manager for Spark applications. Let's focus on this particular cluster manager only (as there are others that Spark supports like Apache Mesos, Spark Standalone, DC/OS and soon Kubernetes that have their own ways to deal with Spark deployments).
By default, while submitting a Spark application using spark-submit, the Spark application (i.e. the SparkContext it uses actually) requests three YARN containers. One container is for that Spark application's ApplicationMaster that knows how to talk to YARN and request two other YARN containers for two Spark executors.
You could review the YARN official documentation's Apache Hadoop YARN and Hadoop: Writing YARN Applications to dig deeper into the YARN internals.
While submitting the Spark application, Spark's ApplicationMaster is submitted to YARN using the YARN "protocol" that requires that the request for the very first YARN container (container 0) uses ContainerLaunchContext that holds all the necessary launch details (see Client.createContainerLaunchContext).
who delivers the files to each executor
That's how YARN gets told how to launch the ApplicationMaster for the Spark application. While fulfilling the request for a ApplicationMaster container, YARN downloads necessary files which you found in the container's working space.
That's very internal to how any YARN application works on YARN and has (almost) nothing to do with Spark.
The code that's responsible for the communication is in Spark's Client, esp. Client.submitApplication.
and then just runs CoarseGrainedExecutorBackend with the appropriate classpath.
Quoting Mastering Apache Spark 2 gitbook:
CoarseGrainedExecutorBackend is a standalone application that is started in a resource container when (...) Spark on YARN’s ExecutorRunnable is started.
ExecutorRunnable is started when when Spark on YARN's YarnAllocator schedules it in allocated YARN resource containers.
What are the scripts? Are they all YARN-autogenerated?
Kind of.
Some are prepared by Spark as part of a Spark application submission while others are YARN-specific.
Enable DEBUG logging level in your Spark application and you'll see the file transfer.
You can find more information in the Spark official documentation's Running Spark on YARN and the Mastering Apache Spark 2 gitbook of mine.

Why can't I see my spark job on port 4040 even though I see it running in the yarn UI?

It was my understanding that we can always see spark jobs at port 4040 on the master node.
I am running spark on yarn, and I'm running a job through spark submit. When I go into the yarn UI, I am able to see that the job is RUNNING, and when I go to the spark master at <master ip>:18081 then I can see that the spark job is picked up by the master in the history server.
(and of course I do not see it runing at <master ip>:18080 because I'm not running in standalone).
However, I am not able to see the job running at port 4040 in the yarn UI.
The job does not fail - it just hangs for forever. And I can see that my workers are alive in the spark master UI.
This is the resource utilization shown by yarn while it is running:
How do I fix this to show that my spark job is running, or at least give me some feedback on whether or not my job is progressing?

Role of master in Spark standalone cluster

In a Spark standalone cluster, what is exactly the role of the master (a node started with start_master.sh script)?
I understand that is the node that receives the jobs from the submit-job.sh script, but what is its role when processing a job?
I'm seeing in the web UI that always delivers the job to a slave (a node started with start_slave.sh) and is not participating from processing, Am I right? In that case, should I also run also the script start_slave.sh in the same machine than master to to take advantage of its resources (cpu and memory)?
Thanks in advance.
Spark runs in the following cluster modes:
Local
Standalone
Mesos
Yarn
The above are cluster modes which offer resources to Spark Applications
Spark standalone mode is master slave architecture, we have Spark Master and Spark Workers. Spark Master runs in one of the cluster nodes and Spark Workers run on the Slave nodes of the cluster.
Spark Master (often written standalone Master) is the resource manager
for the Spark Standalone cluster to allocate the resources (CPU, Memory, Disk etc...) among the
Spark applications. The resources are used to run the Spark Driver and Executors.
Spark Workers report to Spark Master about resources information on the Slave nodes.
[apache-spark]
Spark standalone comes with its own resource manager. Think about Spark Master/Worker as YARN ResourceManager/NodeManager.

Setting Driver manually in Spark Submit over Yarn Cluster

I noticed that when I start a job in spark submit using yarn, the driver and executor nodes get set randomly. Is it possible to set this manually, so that when I collect the data and write it to file, it can be written on the same node every single time?
As of right now, the parameter I tried playing around with are:
spark.yarn.am.port <driver-ip-address>
and
spark.driver.hostname <driver-ip-address>
Thanks!
If you submit to Yarn with --master yarn --deploy-mode client, the driver will be located on the node you are submitting from.
Also you can configure node labels for executors using property: spark.yarn.executor.nodeLabelExpression
A YARN node label expression that restricts the set of nodes executors will be scheduled on. Only versions of YARN greater than or equal to 2.6 support node label expressions, so when running against earlier versions, this property will be ignored.
Docs - Running Spark on YARN - Latest Documentation
A spark cluster can run in either yarncluster or yarn-client mode.
In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client machine can go away after initiating the application.
In yarn-client mode, the driver runs in the client
process, and the application master is only used for requesting resources from YARN.
So as you see, depending upon the mode, the spark picks up the Application Master. Its not happened randomly until this stage. However, the worker nodes which the application master requests the resource manager to perform tasks will be randomly picked based on the availability of the worker nodes.

Apache Spark and Mesos running on a single node

I am interested in testing Spark running on Mesos. I created a Hadoop 2.6.0 single-node cluster in my Virtualbox and installed Spark on it. I can successfully process files in HDFS using Spark.
Then I installed Mesos Master and Slave on the same node. I tried to run Spark as a framework in Mesos using these instructions. I get the following error with Spark:
WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient resources
Sparkshell is successfully registered as a framework in the Mesos. Is there anything wrong with using a single-node setup? Or whether I need to add more Spark worker nodes?
I am very new to Spark and my aim is to just test Spark, HDFS, and Mesos.
If you have allocated enough resources for spark slaves, the cause might be firewall blocking the communication. Take a look at my other answer:
Apache Spark on Mesos: Initial job has not accepted any resources

Resources