Do nodes in Spark Cluster share the same storage? - apache-spark

I a newbie to spark. I am using Azure Databricks and I am writing python code with PySpark. There is one particular topic which is confusing me:
Do nodes have separate storage memory (I don't mean RAM/cache)? Or they all share the same storage? If they share the same storage, then can two different applications running in different Spark Context exchange data accordingly?
I don't understand why sometime we refer to the storage by dbfs:/tmp/..., and other times we refer to it by /dbfs/tmp/... Example: If I am using the dbutils package from databricks, we use something like: dbfs:/tmp/... to refer to a directory in the file system. However, if I'm using regular python code, I say /dbfs/tmp/.
Your help is much appreciated!!

Each node has separate RAM memory and caching. For example, if you have a cluster with 4GB and 3 nodes. When you deploy your spark application, it will run worker processes depending on cluster configuration and query requirements and it will create virtual machines on separate nodes or on same node. These node memories are not shared between each other during the life of the application.
This is more about Hadoop resource sharing question and can find more information from YARN resource management. This is very brief overview
https://databricks.com/session/resource-management-and-spark-as-a-first-class-data-processing-framework-on-hadoop

Related

Can we create a Hadoop Cluster on Dataproc with 0%-2% of HDFS?

Is it possible to create a Hadoop cluster on Dataproc with no or very minimal HDFS space by setting dfs.datanode.du.reserved to about 95% or 100% of the total node size? The plan is to use GCS for all persistent storage while the local file system will primarily be used for Spark's shuffle data. Some of the Hive queries may still need the scratch on HDFS which explains the need for minimal HDFS.
I did create a cluster with a 10-90 split and did not notice any issues with my test jobs.
Could there be stability issues with Dataproc if this approach is taken?
Also, are there concerns with deleting the Data Node daemon from Dataproc's worker nodes, thereby using the Primary workers as compute only nodes. The rationale is that Dataproc currently doesn't allow a mix of preemptible and non preemptible secondary workers. So want to check if we can repurpose primary workers as compute only non-PVM nodes while the other secondary workers can be compute only PVM nodes.
I am starting a GCP project and am well-versed enough in AZURE and AWS less so, but know enough there having done a DDD setup.
What you describe is similar to AWS setup and I looked recently here: https://jayendrapatil.com/google-cloud-dataproc/
My impression is you can run without HDFS here as well - 0%. The key point is that performance with a suite of jobs will - like also for AWS & AZURE - benefit from writing to and reading from ephemeral HDFS, as it is faster than Google Cloud Storage. I cannot see stability issues; I can use Spark now without HDFS if I really want.
On the 2nd question, stick to what they have engineered. Why try and force things? On AWS we lived with the limitations on scaling down with Spark.

HDFS vs HDFS with YARN, and if I use spark can I put new resource management?

What is HDFS alone without YARN Use Case, and with it? Should I use MapReduce or I can only use the spark? Also if I use spark can I put new resource management for the spark instead of the yarn in the same system? And is this the optimal solution for it, how to decide each one here? based on use case
Sorry, I don't have a specific use case!
Hadoop Distributed File System
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. When that quantity and quality of enterprise data is available in HDFS, and YARN enables multiple data access applications to process it, Hadoop users can confidently answer questions that eluded previous data platforms.
HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.
Take Away
HDFS is based on a master Slave Architecture with Name Node (NN) being the master and Data Nodes (DN) being the slaves.
Name Node stores only the meta Information about the files, actual data is stored in Data Node.
Both Name Node and Data Node are processes and not any super fancy Hardware.
The Data Node uses the underlying OS file System to save the data.
You need to use HDFS client to interact with HDFS. The hdfs clients always talks to Name Node for meta Info and subsequently talks to Data Nodes to read/write data. No Data IO happens through Name Node.
HDFS clients never send data to Name Node hence Name Node never becomes a bottleneck for any Data IO in the cluster
HDFS client has "short-circuit" feature enabled hence if the client is running on a Node hosting Data Node it can read the file from the Data Node making the complete read/write Local.
To even make it simple imagine HDFSclient is a web client and HDFS as whole is a web service which has predefined task to GET, PUT, COPYFROMLOCAL etc.
How is a 400 MB file Saved on HDFS with hdfs block size of 100 MB.
YARN (Yet Another Resource Negotiator )
"does it ring a bell 'Yet Another Hierarchically Organized Oracle' YAHOO"
YARN is essentially a system for managing distributed applications. It consists of a central Resource manager (RM), which arbitrates all available cluster resources, and a per-node Node Manager (NM), which takes direction from the Resource manager. The Node manager is responsible for managing available resources on a single node.
Take Away
YARN is based on a master Slave Architecture with Resource Manager being the master and Node Manager being the slaves.
Resource Manager keeps the meta info about which jobs are running on which Node Manage and how much memory and CPU is consumed and hence has a holistic view of total CPU and RAM consumption of the whole cluster.
The jobs run on the Node Manager and jobs never get execute on Resource Manager. Hence RM never becomes a bottleneck for any job execution. Both RM and NM are processes and not some fancy hardware
Container is logical abstraction for CPU and RAM.
YARN (Yet Another Resource Negotiator) is scheduling container (CPU and RAM ) over the whole cluster. Hence for end user if he needs CPU and RAM in the cluster it needs to interact with YARN
While Requesting for CPU and RAM you can specify the Host one which you need it.
To interact with YARN you need to use yarn-client which
How HDFS and YARN work in TANDEM
Name Node and Resource Manager process are hosted on two different host. As they hold key meta information.
The Data Node and Node manager processes are co-located on same host.
A file is saved onto HDFS (Data Nodes) and to access a file in Distributed way one can write a YARN Application (MR2, SPARK, Distributed Shell, Slider Application) using YARN client and to read data use HDFSclient.
The Distributed application can fetch file location ( meta info From Name Node ) ask Resource Manager (YARN) to provide containers on the hosts which hold the file blocks.
Do remember the short-circuit optimization provided by HDFS, hence if the Distributed job gets a container on a host which host the file block and tries to read it, the read will be local and not over the network.
The same file If read sequentially would have taken 4 sec (100 MB/sec speed) can be read in 1 second as Distributed process is running parallely on different YARN container( Node Manager) and reading 100 MB/sec *4 in 1 second.
HDFS can be used as a fault tolerant filesystem without YARN, yes. But so can Ceph, MinIO, GlusterFS, etc, each of which can work with Spark.
That addresses storage, but for processing, you can only configure one scheduler per Spark job, but the same code should be able to run in any environment, whether that be YARN, Spark Standalone, Mesos, or Kubernetes, but you ideally would not install these together
So therefore, neither HDFS nor YARN are required for Spark

Is it possible to know the resources used by a specific Spark job?

I'm drawing on ideas of using a multi tenant Spark cluster. The cluster execute jobs on demand for a specific tenant.
Is it possible to "know" the specific resources used by a specific job (for payment reasons)? E.g. if a job requires that several nodes in kubernetes is automatically allocated is it then possible to track which Spark jobs (and tenant at the end) that initiated these resource allocations? Or, jobs are always evenly spread out on allocated resources?
Tried to find information at the Apache Spark site and else where on the internet without success.
See https://spark.apache.org/docs/latest/monitoring.html
You can save data from Spark History Server as json and then write your own resource calc stuff.
It is Spark App you mean.

While creating Azure HDInsight cluster for Starburst Presto, can I create Spark Cluster?

While creating infrastructure for big data, I wanted to use Azure HDInsight with Presto installation. Azure HDInsight comes with different flavors like hadoop, spark etc. In documentations it is recommended to use hadoop cluster but I want to use the spark one.
Is it possible to use spark cluster with Starburst's Presto distribution?
It looks like you want to use both Presto and Spark at the same time.
If you run them on a single cluster, you would need to configure them appropriately to make sure the JVMs for different processes can co-exist. This is possible, but hard to do in practice (you need to know how JVM allocates memory beyond -Xmx setting), so it's definitely not recommended.
While I can imagine that in some on-premises installations where provisioning new hardware is hard you could want to colocate services on one cluster. In the cloud, it's much more convenient to provision two separate clusters, each appropriately sized for your particular needs and workload. For example, you could have one cluster with Presto for interactive analytics, dashboarding and ad-hoc queries. And another one with Spark for your machine learning or ETL workloads.
Please refer to the Starburst Presto on Azure documentation for detailed configuration instructions.

How does data locality work with OpenStack Swift on IBM Bluemix?

I'm currently playing around with the Apache Spark Service in IBM Bluemix. Since the IBM Cloud relies on OpenStack Swift as Data Storage for this service I'm wondering if there is any data locality (at least possible) with that architecture.
If I'm right with HDFS the SparkDriver asks the HDFS namenode about the datanodes containing the various blocks of a file and then schedules the work to the SparkWorkers.
So I've checked the Swift API there is a Range parameter which would allow the SparkWorker to at least read only local blocks, but how can the SparkDriver find out these ranges?
Any ideas?
This is the disaggregation of compute and storage. That is, the spark compute nodes are not at all shared with the swift cluster storage nodes. This confers benefits on scalability of compute separate from storage, and vice versa. But in this model, you cannot have data locality ... by definition. So how this works, roughly, is that each spark executor can pull its own range of blocks of the object from the swift cluster, such that each executor does not need to pull in all the object data only operate on its own portion; which would be inefficient. But the blocks are still pulled from the remote swift cluster, then are not local. The only question here is how long it takes to pull the blocks into each executor so that doesn't slow you down. In the case of the Bluemix Apache Spark Service and the Bluemix or Softlayer Object Storage service, there is low latency and a fast network between them.
re: "Since the IBM Cloud relies on OpenStack Swift as Data Storage for this service". There will be other data sources available to the spark service as the beta progresses, so it will not be 100% reliance.

Resources