HDFS vs HDFS with YARN, and if I use spark can I put new resource management? - apache-spark

What is HDFS alone without YARN Use Case, and with it? Should I use MapReduce or I can only use the spark? Also if I use spark can I put new resource management for the spark instead of the yarn in the same system? And is this the optimal solution for it, how to decide each one here? based on use case
Sorry, I don't have a specific use case!

Hadoop Distributed File System
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. When that quantity and quality of enterprise data is available in HDFS, and YARN enables multiple data access applications to process it, Hadoop users can confidently answer questions that eluded previous data platforms.
HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.
Take Away
HDFS is based on a master Slave Architecture with Name Node (NN) being the master and Data Nodes (DN) being the slaves.
Name Node stores only the meta Information about the files, actual data is stored in Data Node.
Both Name Node and Data Node are processes and not any super fancy Hardware.
The Data Node uses the underlying OS file System to save the data.
You need to use HDFS client to interact with HDFS. The hdfs clients always talks to Name Node for meta Info and subsequently talks to Data Nodes to read/write data. No Data IO happens through Name Node.
HDFS clients never send data to Name Node hence Name Node never becomes a bottleneck for any Data IO in the cluster
HDFS client has "short-circuit" feature enabled hence if the client is running on a Node hosting Data Node it can read the file from the Data Node making the complete read/write Local.
To even make it simple imagine HDFSclient is a web client and HDFS as whole is a web service which has predefined task to GET, PUT, COPYFROMLOCAL etc.
How is a 400 MB file Saved on HDFS with hdfs block size of 100 MB.
YARN (Yet Another Resource Negotiator )
"does it ring a bell 'Yet Another Hierarchically Organized Oracle' YAHOO"
YARN is essentially a system for managing distributed applications. It consists of a central Resource manager (RM), which arbitrates all available cluster resources, and a per-node Node Manager (NM), which takes direction from the Resource manager. The Node manager is responsible for managing available resources on a single node.
Take Away
YARN is based on a master Slave Architecture with Resource Manager being the master and Node Manager being the slaves.
Resource Manager keeps the meta info about which jobs are running on which Node Manage and how much memory and CPU is consumed and hence has a holistic view of total CPU and RAM consumption of the whole cluster.
The jobs run on the Node Manager and jobs never get execute on Resource Manager. Hence RM never becomes a bottleneck for any job execution. Both RM and NM are processes and not some fancy hardware
Container is logical abstraction for CPU and RAM.
YARN (Yet Another Resource Negotiator) is scheduling container (CPU and RAM ) over the whole cluster. Hence for end user if he needs CPU and RAM in the cluster it needs to interact with YARN
While Requesting for CPU and RAM you can specify the Host one which you need it.
To interact with YARN you need to use yarn-client which
How HDFS and YARN work in TANDEM
Name Node and Resource Manager process are hosted on two different host. As they hold key meta information.
The Data Node and Node manager processes are co-located on same host.
A file is saved onto HDFS (Data Nodes) and to access a file in Distributed way one can write a YARN Application (MR2, SPARK, Distributed Shell, Slider Application) using YARN client and to read data use HDFSclient.
The Distributed application can fetch file location ( meta info From Name Node ) ask Resource Manager (YARN) to provide containers on the hosts which hold the file blocks.
Do remember the short-circuit optimization provided by HDFS, hence if the Distributed job gets a container on a host which host the file block and tries to read it, the read will be local and not over the network.
The same file If read sequentially would have taken 4 sec (100 MB/sec speed) can be read in 1 second as Distributed process is running parallely on different YARN container( Node Manager) and reading 100 MB/sec *4 in 1 second.

HDFS can be used as a fault tolerant filesystem without YARN, yes. But so can Ceph, MinIO, GlusterFS, etc, each of which can work with Spark.
That addresses storage, but for processing, you can only configure one scheduler per Spark job, but the same code should be able to run in any environment, whether that be YARN, Spark Standalone, Mesos, or Kubernetes, but you ideally would not install these together
So therefore, neither HDFS nor YARN are required for Spark

Related

Can we create a Hadoop Cluster on Dataproc with 0%-2% of HDFS?

Is it possible to create a Hadoop cluster on Dataproc with no or very minimal HDFS space by setting dfs.datanode.du.reserved to about 95% or 100% of the total node size? The plan is to use GCS for all persistent storage while the local file system will primarily be used for Spark's shuffle data. Some of the Hive queries may still need the scratch on HDFS which explains the need for minimal HDFS.
I did create a cluster with a 10-90 split and did not notice any issues with my test jobs.
Could there be stability issues with Dataproc if this approach is taken?
Also, are there concerns with deleting the Data Node daemon from Dataproc's worker nodes, thereby using the Primary workers as compute only nodes. The rationale is that Dataproc currently doesn't allow a mix of preemptible and non preemptible secondary workers. So want to check if we can repurpose primary workers as compute only non-PVM nodes while the other secondary workers can be compute only PVM nodes.
I am starting a GCP project and am well-versed enough in AZURE and AWS less so, but know enough there having done a DDD setup.
What you describe is similar to AWS setup and I looked recently here: https://jayendrapatil.com/google-cloud-dataproc/
My impression is you can run without HDFS here as well - 0%. The key point is that performance with a suite of jobs will - like also for AWS & AZURE - benefit from writing to and reading from ephemeral HDFS, as it is faster than Google Cloud Storage. I cannot see stability issues; I can use Spark now without HDFS if I really want.
On the 2nd question, stick to what they have engineered. Why try and force things? On AWS we lived with the limitations on scaling down with Spark.

Do nodes in Spark Cluster share the same storage?

I a newbie to spark. I am using Azure Databricks and I am writing python code with PySpark. There is one particular topic which is confusing me:
Do nodes have separate storage memory (I don't mean RAM/cache)? Or they all share the same storage? If they share the same storage, then can two different applications running in different Spark Context exchange data accordingly?
I don't understand why sometime we refer to the storage by dbfs:/tmp/..., and other times we refer to it by /dbfs/tmp/... Example: If I am using the dbutils package from databricks, we use something like: dbfs:/tmp/... to refer to a directory in the file system. However, if I'm using regular python code, I say /dbfs/tmp/.
Your help is much appreciated!!
Each node has separate RAM memory and caching. For example, if you have a cluster with 4GB and 3 nodes. When you deploy your spark application, it will run worker processes depending on cluster configuration and query requirements and it will create virtual machines on separate nodes or on same node. These node memories are not shared between each other during the life of the application.
This is more about Hadoop resource sharing question and can find more information from YARN resource management. This is very brief overview
https://databricks.com/session/resource-management-and-spark-as-a-first-class-data-processing-framework-on-hadoop

How many connections will be build btween spark and hdfs when sc.textFile("hdfs://.....") is called

How many connections will be build btween spark and hdfs when sc.textFile("hdfs://.....") is called. The file on hdfs is very large(100G).
Actually, the main idea behind the distributed systems and of course which is designed and implemented in hadoop and spark is to send the process to data. In other words, imagine that there is some data located on hdfs data nodes on our cluster and we have a job which utilizes that data on the same worker. On each machine, you would have a data node and is a spark worker at the same time and may have some other processes like hbase region server too. When an executor is executing one of the scheduled tasks, it retrieves its needed data from the underlying data node. Then for each individual task you would retrieve its data and so you may describe this as one connection to hdfs on its local data node.

How YARN knows data locality in Apache spark in cluster mode

Assume that there is Spark job that is going to read a file named records.txt from HDFS and do some transformations and one action(write the processed output into HDFS). The job will be submitted to YARN cluster mode
Assume also that records.txt is a file of 128 MB and one of its HDFS replicated blocks is also in NODE 1
Lets say YARN is allocating is a executor inside NODE 1 .
How does YARN allocates a executor exactly in a node where the input data is located?
Who tells YARN that one of the replicated HDFS block of records.txt is available in NODE 1 ?
How the data localilty is found By Spark Application ? Is it done by Driver which runs inside Application Master ?
Does YARN know about the datalocality ?
The fundamental question here is:
Does YARN know about the datalocality ?
YARN "knows" what application tells it and it understand structure (topology) of the cluster. When application makes a resource request, it can include specific locality constraints, which might, or might not be satisfied, when resources are allocated.
If constraints cannot be specified, YARN (or any other cluster manager) will attempt to provide best alternative match, based on its knowledge of the cluster topology.
So how application "knows"?
If application uses input source (file system or other), which supports some form of data locality, it can query it corresponding catalog (namenode in case of HDFS) to get locations of the blocks of data it wants to access.
In broader sense Spark RDD can define preferredLocations, depending on a specific RDD implementation, which can be later translated into resource constraints, for the cluster manager (not necessarily YARN).

Spark RDD access restrictions and location within the cluster

I have a question regarding RDD access control.
There is a data which has to be kept only at the given server(or list of them), no raw data is allowed to leave it. The data can be process by some map function and only after that can be transferred further.
Are there any features in Spark or in supported cluster management solutions (e.g. Mesos)?
A HadoopRDD (used by sc.textFile for example) has an affinity to be located on the machine that has the file data. (See HadoopRDD.getPreferredLocations.) map is performed on the same machine then.
But this does not guarantee that the raw data will not leave the machine. If the Spark worker on the machine dies, for example, then another worker will load it from a different machine.
I think the safe option is to run one Spark cluster (or other processing system) on the "secure" machines, perform the map step in this cluster, and write out the result to the HDFS (or other storage system) running on the "unsecure" machines. Then a separate Spark cluster running on the "unsecure" machines can process the data.

Resources