How YARN knows data locality in Apache spark in cluster mode - apache-spark

Assume that there is Spark job that is going to read a file named records.txt from HDFS and do some transformations and one action(write the processed output into HDFS). The job will be submitted to YARN cluster mode
Assume also that records.txt is a file of 128 MB and one of its HDFS replicated blocks is also in NODE 1
Lets say YARN is allocating is a executor inside NODE 1 .
How does YARN allocates a executor exactly in a node where the input data is located?
Who tells YARN that one of the replicated HDFS block of records.txt is available in NODE 1 ?
How the data localilty is found By Spark Application ? Is it done by Driver which runs inside Application Master ?
Does YARN know about the datalocality ?

The fundamental question here is:
Does YARN know about the datalocality ?
YARN "knows" what application tells it and it understand structure (topology) of the cluster. When application makes a resource request, it can include specific locality constraints, which might, or might not be satisfied, when resources are allocated.
If constraints cannot be specified, YARN (or any other cluster manager) will attempt to provide best alternative match, based on its knowledge of the cluster topology.
So how application "knows"?
If application uses input source (file system or other), which supports some form of data locality, it can query it corresponding catalog (namenode in case of HDFS) to get locations of the blocks of data it wants to access.
In broader sense Spark RDD can define preferredLocations, depending on a specific RDD implementation, which can be later translated into resource constraints, for the cluster manager (not necessarily YARN).

Related

Run spark cluster using an independent YARN (without using Hadoop's YARN)

I want to deploy a spark cluster with YARN cluster manager.
This spark cluster needs to read data from an external HDFS filesystem belonging to an existing Hadoop ecosystem that also has its own YARN (However, I am not allowed to use the Hadoop's YARN.)
My Questions are -
Is it possible to run spark cluster using an independent YARN, while still reading data from an outside HDFS filesystem?
If yes, Is there any downside or performance penalty to this approach?
If no, can I run Spark as a standalone cluster, and will there be any performance issue?
Assume both the spark cluster and the Hadoop cluster are running in the same Data Center.
using an independent YARN, while still reading data from an outside HDFS filesystem
Yes. Configure the yarn-site.xml to the necessary cluster and use full FQDN to refer to external file locations such as hdfs://namenode-external:8020/file/path
any downside or performance penalty to this approach
Yes. All reads will be remote, rather than cluster-local. This would effectively be similar performance degradation as reading from S3 or other remote locations, however.
can I run Spark as a standalone cluster
You could, or you could use Kubernetes, if that's available, but both are pointless IMO, if there's already a YARN cluster (with enough resources) available

HDFS vs HDFS with YARN, and if I use spark can I put new resource management?

What is HDFS alone without YARN Use Case, and with it? Should I use MapReduce or I can only use the spark? Also if I use spark can I put new resource management for the spark instead of the yarn in the same system? And is this the optimal solution for it, how to decide each one here? based on use case
Sorry, I don't have a specific use case!
Hadoop Distributed File System
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. When that quantity and quality of enterprise data is available in HDFS, and YARN enables multiple data access applications to process it, Hadoop users can confidently answer questions that eluded previous data platforms.
HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.
Take Away
HDFS is based on a master Slave Architecture with Name Node (NN) being the master and Data Nodes (DN) being the slaves.
Name Node stores only the meta Information about the files, actual data is stored in Data Node.
Both Name Node and Data Node are processes and not any super fancy Hardware.
The Data Node uses the underlying OS file System to save the data.
You need to use HDFS client to interact with HDFS. The hdfs clients always talks to Name Node for meta Info and subsequently talks to Data Nodes to read/write data. No Data IO happens through Name Node.
HDFS clients never send data to Name Node hence Name Node never becomes a bottleneck for any Data IO in the cluster
HDFS client has "short-circuit" feature enabled hence if the client is running on a Node hosting Data Node it can read the file from the Data Node making the complete read/write Local.
To even make it simple imagine HDFSclient is a web client and HDFS as whole is a web service which has predefined task to GET, PUT, COPYFROMLOCAL etc.
How is a 400 MB file Saved on HDFS with hdfs block size of 100 MB.
YARN (Yet Another Resource Negotiator )
"does it ring a bell 'Yet Another Hierarchically Organized Oracle' YAHOO"
YARN is essentially a system for managing distributed applications. It consists of a central Resource manager (RM), which arbitrates all available cluster resources, and a per-node Node Manager (NM), which takes direction from the Resource manager. The Node manager is responsible for managing available resources on a single node.
Take Away
YARN is based on a master Slave Architecture with Resource Manager being the master and Node Manager being the slaves.
Resource Manager keeps the meta info about which jobs are running on which Node Manage and how much memory and CPU is consumed and hence has a holistic view of total CPU and RAM consumption of the whole cluster.
The jobs run on the Node Manager and jobs never get execute on Resource Manager. Hence RM never becomes a bottleneck for any job execution. Both RM and NM are processes and not some fancy hardware
Container is logical abstraction for CPU and RAM.
YARN (Yet Another Resource Negotiator) is scheduling container (CPU and RAM ) over the whole cluster. Hence for end user if he needs CPU and RAM in the cluster it needs to interact with YARN
While Requesting for CPU and RAM you can specify the Host one which you need it.
To interact with YARN you need to use yarn-client which
How HDFS and YARN work in TANDEM
Name Node and Resource Manager process are hosted on two different host. As they hold key meta information.
The Data Node and Node manager processes are co-located on same host.
A file is saved onto HDFS (Data Nodes) and to access a file in Distributed way one can write a YARN Application (MR2, SPARK, Distributed Shell, Slider Application) using YARN client and to read data use HDFSclient.
The Distributed application can fetch file location ( meta info From Name Node ) ask Resource Manager (YARN) to provide containers on the hosts which hold the file blocks.
Do remember the short-circuit optimization provided by HDFS, hence if the Distributed job gets a container on a host which host the file block and tries to read it, the read will be local and not over the network.
The same file If read sequentially would have taken 4 sec (100 MB/sec speed) can be read in 1 second as Distributed process is running parallely on different YARN container( Node Manager) and reading 100 MB/sec *4 in 1 second.
HDFS can be used as a fault tolerant filesystem without YARN, yes. But so can Ceph, MinIO, GlusterFS, etc, each of which can work with Spark.
That addresses storage, but for processing, you can only configure one scheduler per Spark job, but the same code should be able to run in any environment, whether that be YARN, Spark Standalone, Mesos, or Kubernetes, but you ideally would not install these together
So therefore, neither HDFS nor YARN are required for Spark

Spark Standalone vs YARN

What features of YARN make it better than Spark Standalone mode for multi-tenant cluster running only Spark applications? Maybe besides authentication.
There are a lot of answers at Google, pretty much of them sounds wrong to me, so I'm not sure where is the truth.
For example:
DZone, Deep Dive Into Spark Cluster Management
Standalone is good for small Spark clusters, but it is not good for
bigger clusters (there is an overhead of running Spark daemons —
master + slave — in cluster nodes)
But other cluster managers also require running agents on cluster nodes. I.e. YARN's slaves are called node managers. They may consume even more memory than Spark's slaves (Spark default is 1 GB).
This answer
The Spark standalone mode requires each application to run an executor
on every node in the cluster; whereas with YARN, you choose the number
of executors to use
agains Spark Standalone # executor/cores control, that shows how you can specify number of consumed resources at Standalone mode.
Spark Standalone Mode documentation
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications.
Against the fact Standalone mode can use Dynamic Allocation, and you can specify spark.dynamicAllocation.minExecutors & spark.dynamicAllocation.maxExecutors. Also I haven't found a note about Standalone doesn't support FairScheduler.
This answer
YARN directly handles rack and machine locality
How does YARN may know anything about data locality in my job? Suppose, I'm storing file locations at AWS Glue (used by EMR as Hive metastore). Inside Spark job I'm querying some-db.some-table. How YARN may know what executor is better for job assignment?
UPD: found another mention about YARN and data locality https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-data-locality.html. Still doesn't matter in case of S3 for example.

How does spark dynamic resource allocation work on YARN (with regards to NodeManagers)?

Let's assume that I have 4 NM and I have configured spark in yarn-client mode. Then, I set dynamic allocation to true to automatically add or remove a executor based on workload. If I understand correctly, each Spark executor runs as a Yarn container.
So, if I add more NM will the number of executors increase ?
If I remove a NM while a Spark application is running, something will happen to that application?
Can I add/remove executors based on other metrics ? If the answer is yes, there is a function, preferably in python,that does that ?
If I understand correctly, each Spark executor runs as a Yarn container.
Yes. That's how it happens for any application deployed to YARN, Spark including. Spark is not in any way special to YARN.
So, if I add more NM will the number of executors increase ?
No. There's no relationship between the number of YARN NodeManagers and Spark's executors.
From Dynamic Resource Allocation:
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand.
As you may have guessed correctly by now, it is irrelevant how many NMs you have in your cluster and it's by the workload when Spark decides whether to request new executors or remove some.
If I remove a NM while a Spark application is running, something will happen to that application?
Yes, but only when Spark uses that NM for executors. After all, NodeManager gives resources (CPU and memory) to a YARN cluster manager that will in turn give them to applications like Spark applications. If you take them back, say by shutting the node down, the resource won't be available anymore and the process of a Spark executor simply dies (as any other process with no resources to run).
Can I add/remove executors based on other metrics ?
Yes, but usually it's Spark job (no pun intended) to do the calculation and requesting new executors.
You can use SparkContext to manage executors using killExecutors, requestExecutors and requestTotalExecutors methods.
killExecutor(executorId: String): Boolean Request that the cluster manager kill the specified executor.
requestExecutors(numAdditionalExecutors: Int): Boolean Request an additional number of executors from the cluster manager.
requestTotalExecutors(numExecutors: Int, localityAwareTasks: Int, hostToLocalTaskCount: Map[String, Int]): Boolean Update the cluster manager on our scheduling needs.

Spark: hdfs cluster mode

I'm just getting started using Apache Spark. I'm using cluster mode (master, slave1, slave2) and I want to process a big file which is kept in Hadoop (hdfs). I am using the textFile method from SparkContext; while the file is being processing I monitorize the nodes and I can see that just the slave2 is working. After processing, slave2 has tasks but slave1 has no task.
If instead of using a hdfs I use a local file then both slaves work simultaneously.
I don't get why this behaviour. Please, can anybody give me a clue?
The main reason of that behavior is the concept of data locality. When Spark's Application Master asks for the creation of new executors, they are tried to be allocated in the same node where data resides.
I.e. in your case, HDFS is likely to have written all the blocks of the file on the same node. Thus Spark will instantiate the executors on that node. Instead, if you use a local file, it will be present in all nodes, so data locality won't be an issue anymore.

Resources