Spark shuffle service on local shared dir with Ceph on kubernetes - apache-spark

We run Spark 3.X on kubernetes, executor pods share the same readWriteMany Ceph volume.
So, all Spark workers write shuffle data on the same volume (I guess into different dirs), available for any worker.
On other side, Spark is sharing shuffle data over network.
How can I configure Spark to use local volume to get shuffle data from other worker rather than using TCP download?

Related

Run spark cluster using an independent YARN (without using Hadoop's YARN)

I want to deploy a spark cluster with YARN cluster manager.
This spark cluster needs to read data from an external HDFS filesystem belonging to an existing Hadoop ecosystem that also has its own YARN (However, I am not allowed to use the Hadoop's YARN.)
My Questions are -
Is it possible to run spark cluster using an independent YARN, while still reading data from an outside HDFS filesystem?
If yes, Is there any downside or performance penalty to this approach?
If no, can I run Spark as a standalone cluster, and will there be any performance issue?
Assume both the spark cluster and the Hadoop cluster are running in the same Data Center.
using an independent YARN, while still reading data from an outside HDFS filesystem
Yes. Configure the yarn-site.xml to the necessary cluster and use full FQDN to refer to external file locations such as hdfs://namenode-external:8020/file/path
any downside or performance penalty to this approach
Yes. All reads will be remote, rather than cluster-local. This would effectively be similar performance degradation as reading from S3 or other remote locations, however.
can I run Spark as a standalone cluster
You could, or you could use Kubernetes, if that's available, but both are pointless IMO, if there's already a YARN cluster (with enough resources) available

Yarn as resource manager in SPARK for linux cluster - inside Kubernetes and outside Kubernetes

If I am using Kubernetes cluster to run spark, then I am using Kubernetes resource manager in Spark.
If I am using Hadoop cluster to run spark, then I am using Yarn resource manager in Spark.
But my question is, if I am spawning multiple linux nodes in kebernetes, and use one of the node as spark maste and three other as worker, what resource manager should I use? can I use yarn over here?
Second question, in case of any 4 node linux spark cluster (not in kubernetes and not hadoop, simple connected linux machines), even if I do not have hdfs, can I use yarn here as resource manager? if not, then what resource manager should be used for saprk?
Thanks.
if I am spawning multiple linux nodes in kebernetes,
Then you'd obviously use kubernetes, since it's available
in case of any 4 node linux spark cluster (not in kubernetes and not hadoop, simple connected linux machines), even if I do not have hdfs, can I use yarn here
You can, or you can use Spark Standalone scheduler, instead. However Spark requires a shared filesystem for reading and writing data, so, while you could attempt to use NFS, or S3/GCS for this, HDFS is faster

Can Apache Spark worker nodes be different machines than HDFS data nodes?

I have a HDFS cluster (say it has 5 data nodes), if I want to setup a Spark cluster (say it has 3 worker nodes) read/write data to the HDFS cluster, do I need to make sure the Spark worker nodes are in the same machines of the HDFS data nodes? IMO they can be different machines. But if Spark worker nodes and HDFS data nodes are different machines, when read data from HDFS, Spark worker nodes need to download data from different machines which can lead to higher latency. While if they are on same machines latency can be reduced. Is my understanding correct?
In a bare metal set up and as originally postulated by MR, the Data Locality principle applies as you state, and Spark would be installed on all the Data Nodes, implying they would be also a Worker Node. So, Spark Worker resides on Data Node for rack-awareness and Data Locality for HDFS. That said, there are other storage managers such as KUDU now and other NOSQL variants that do not use HDFS.
With Cloud approaches for Hadoop you see Storage and compute divorced necessarily, e.g. AWS EMR and EC2, et al. That cannot be otherwise in terms of elasticity in compute. Not that bad as Spark shuffles to same Workers once data gotten for related keys where possible.
So, for Cloud the question is not actually relevant anymore. For bare metal Spark can be installed on different machines but would not make sense. I would install on all HDFS nodes, 5 not 3 as I understand in such a case.

How to update spark configuration after resizing worker nodes in Cloud Dataproc

I have a DataProc Spark cluster. Initially, the master and 2 worker nodes are of type n1-standard-4 (4 vCPU, 15.0 GB memory), then I resized all of them to n1-highmem-8 (8 vCPUs, 52 GB memory) via the web console.
I noticed that the two workers nodes are not being fully used. In particular, there are only 2 executors on the first worker node and 1 executor on the second worker node, with
spark.executor.cores 2
spark.executor.memory 4655m
in the /usr/lib/spark/conf/spark-defaults.conf. I thought with spark.dynamicAllocation.enabled true, the number of executors will be increased automatically.
Also, The information on DataProc page of the web console doesn't get updated automatically, either. It seems that DataProc still think that all nodes are n1-standard-4.
My questions are
why are there more executors on the first worker node than the second?
why are not more executors added to each node?
Ideally, I want the whole cluster to get fully utilized, if the spark configuration needs updated, how?
As you've found a cluster's configuration is set when the cluster is first created and does not adjust to manual resizing.
To answer your questions:
The Spark ApplicationMaster takes a container in YARN on a worker node, usually the first worker if only a single spark application is running.
When a cluster is started, Dataproc attempts to fit two YARN containers per machine.
The YARN NodeManager configuration on each machine determines how much of the machine's resources should be dedicated to YARN. This can be changed on each VM under /etc/hadoop/conf/yarn-site.xml, followed by a sudo service hadoop-yarn-nodemanager restart. Once machines are advertising more resources to the ResourceManager, Spark can start more containers. After adding more resources to YARN, you may want to modify the size of containers requested by Spark by modifying spark.executor.memory and spark.executor.cores.
Instead of resizing cluster nodes and manually editing configuration files afterwards, consider starting a new cluster with new machine sizes and copy any data from your old cluster to the new cluster. In general, the simplest way to move data is to use hadoop's built in distcp utility. An example usage would be something along the lines of:
$ hadoop distcp hdfs:///some_directory hdfs://other-cluster-m:8020/
Or if you can use Cloud Storage:
$ hadoop distcp hdfs:///some_directory gs://<your_bucket>/some_directory
Alternatively, consider always storing data in Cloud Storage and treating each cluster as an ephemeral resource that can be torn down and recreated at any time. In general, any time you would save data to HDFS, you can also save it as:
gs://<your_bucket>/path/to/file
Saving to GCS has the nice benefit of allowing you to delete your cluster (and data in HDFS, on persistent disks) when not in use.

What is hadoop (single and multi) nodes, spark-master and spark-worker?

I want to understand the following terms:
hadoop (single-node and multi-node)
spark master
spark worker
namenode
datanode
What I understood so far is spark master is the job executor and handles all the spark workers. Whereas hadoop is the hdfs (where our data resides) and from where spark workers reads data according to the job given to them. Please correct me if I wrong.
I also want to understand the role of namenode and datanode. Though I know the role of namenode (having the metadata info of all datanodes and it should be only one preferably, but could be two) and datanodes could be multiple and having the data.
Are datanodes the same hadoop nodes?
SPARK Architecture:
Spark uses a master/worker architecture. There is a driver that talks to a single coordinator called master that manages workers in which executors run.
The driver and the executors run in their own Java processes. You can run them all on the same (horizontal cluster) or separate machines (vertical cluster) or in a mixed machine configuration.
Node are nothing but the physical machines.
Hadoop NameNode and DataNode:
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Yeah, DataNodes are the slave node in Hadoop cluster.
Please refer the documentation for more details.
Hadoop single-node Hadoop cluster with 1 Namenode(master) and 1 Datanode(slave). Namenode have all the metadata and assigns for to slaves datanodes where data is stored and processing is done.
Hadoop multi-node Hadoop cluster with 1 Namenode(master) and n Datanode(slave)
spark master Same as Namenode in HDFS
spark worker Same as datanode but spark worker is only meant for processing not storing data.
To put thing in context(simple) - If there is 1 Namenode and 2 datanode(1GB memory) cluster. A 2 GB file will be split and stored on datanodes. Similarly to spark job will be split to process this data on individual datanodes(workers) in parallel.

Resources