Yarn as resource manager in SPARK for linux cluster - inside Kubernetes and outside Kubernetes - apache-spark

If I am using Kubernetes cluster to run spark, then I am using Kubernetes resource manager in Spark.
If I am using Hadoop cluster to run spark, then I am using Yarn resource manager in Spark.
But my question is, if I am spawning multiple linux nodes in kebernetes, and use one of the node as spark maste and three other as worker, what resource manager should I use? can I use yarn over here?
Second question, in case of any 4 node linux spark cluster (not in kubernetes and not hadoop, simple connected linux machines), even if I do not have hdfs, can I use yarn here as resource manager? if not, then what resource manager should be used for saprk?
Thanks.

if I am spawning multiple linux nodes in kebernetes,
Then you'd obviously use kubernetes, since it's available
in case of any 4 node linux spark cluster (not in kubernetes and not hadoop, simple connected linux machines), even if I do not have hdfs, can I use yarn here
You can, or you can use Spark Standalone scheduler, instead. However Spark requires a shared filesystem for reading and writing data, so, while you could attempt to use NFS, or S3/GCS for this, HDFS is faster

Related

Integrating hadoop yarn with mesos infra

I have created a hdfs cluster . I have to configure yarn so as to allow yarn application master to be able to create containers for job processing on the mesos cluster on demand .
how can i integrate the hdfs cluster with the mesos infra so that it can create containers on mesos ?
i need to figure out a way to run the containers created by the application master on another resources apart from the YARN cluster ( a client node, or edge node, or the resources spun through mesos infra ) . basically , i have to create an on-demand ,compute only cluster which can run the yarn apps once yarn is used up
Mesos was created as a more generic version of YARN, they're not really intended to be used together (YARN apps cannot be deployed to Mesos). Spark apps are about the only process in the whole Hadoop ecosystem that can be deployed (independently) to both.
Worth pointing out that Mesos was moved to Apache Attic (edit: and quickly moved out, it seems, but there's been no releases since then). In other words, it's seen as deprecated. With a bit of configuration, YARN can run plain Docker containers, if that's what you're using Mesos for. Apache Twill was a library for creating distributed applications on top of YARN, but that's also in the Apache Attic (and stayed).
You also don't need special configurations to communicate with HDFS from Mesos applications, only the hadoop-client dependency and a configured core-site.xml and hdfs-site.xml file

Run spark cluster using an independent YARN (without using Hadoop's YARN)

I want to deploy a spark cluster with YARN cluster manager.
This spark cluster needs to read data from an external HDFS filesystem belonging to an existing Hadoop ecosystem that also has its own YARN (However, I am not allowed to use the Hadoop's YARN.)
My Questions are -
Is it possible to run spark cluster using an independent YARN, while still reading data from an outside HDFS filesystem?
If yes, Is there any downside or performance penalty to this approach?
If no, can I run Spark as a standalone cluster, and will there be any performance issue?
Assume both the spark cluster and the Hadoop cluster are running in the same Data Center.
using an independent YARN, while still reading data from an outside HDFS filesystem
Yes. Configure the yarn-site.xml to the necessary cluster and use full FQDN to refer to external file locations such as hdfs://namenode-external:8020/file/path
any downside or performance penalty to this approach
Yes. All reads will be remote, rather than cluster-local. This would effectively be similar performance degradation as reading from S3 or other remote locations, however.
can I run Spark as a standalone cluster
You could, or you could use Kubernetes, if that's available, but both are pointless IMO, if there's already a YARN cluster (with enough resources) available

How to run a Spark Standalone master on Kubernetes that will use the Kubernetes Cluser Manager to start workers

I have an application that currently uses Standalone Mode locally to use spark functionality via the SparkContext. We are not using spark-submit to upload our jobs, we are running our application in a container on kubernetes so we would like to take advantage of the dynamic scheduling that kubernetes provides to run the jobs.
We started out looking for a helm chart to create stand alone cluster running on kubernetes similar to how you would have run a standalone cluster on machines ( vms or actual machines ) a few years ago and came across the following
https://github.com/helm/charts/tree/master/stable/spark
Issues:
very old instances of spark
not using the containers provided by spark
this setup wastes a bunch of resources if you need to have large worker nodes reserved and running all the time regardless of your need
Next we started looking at the spark-operator approach here https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
Issues:
Doesn't support the way we interact with spark, takes the approach that all the apps are standalone apps that are pushed to the cluster to run
No longstanding master that allows us to take advantage of cached resources in the cluster
Along this journey we discovered that spark now supports a kubernetes cluster manager ( similar to the way it does with yarn, mesos ) so we are looking that this might be the best approach, but this still does not provide a standalone master that would allow for the in memory caching. I have looked to see if there was a way that I could get the org.apache.spark.deploy.master.Master to start and use the
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager
So I guess what I'm trying to ask is does anyone have any experience in trying to run a Standalone Master, that would use the kubernetes backend such as "KubernetesClusterManager" in order to have the worker nodes dynamically created as pods and running executors while having a permanent Standalone Master that would allow a SparkContext to connect to it remotely in client mode.

Installing Presto on a VM cluster and connecting it to HDFS on a different Yarn cluster

we have an HDP 2.6.4 spark cluster with 10 linux worker machines.
The cluster runs spark applications over HDFS. The HDFS is installed on all the workers.
We wish to install presto that will query the HDFS of the cluster, however due to lack of CPU resources in the worker machines (only 32 cores per machine) the plan is to install presto outside of the cluster.
For that purpose we have several ESX, each ESX will have 2 VMs, and each VM will run a single presto server.
All the ESX machines will connected to the spark cluster via 10g network cards so that the two clusters will be in the same network.
My question is - can we install presto on the VM cluster and although the HDFS is not on the ESX cluster (but instead on the spark cluster)?
EDIT:
Fromt eh answer we got it seems that installing presto on VM is standard, so I'd like to clarify my question:
Presto has a configuration file named hive.properties under presto/etc.
Inside that file there’s a parameter named hive.config.resources with the following value:
/etc/hadoop/conf/presto-hdfs-site.xml,/etc/hadoop/conf/presto-core-site.xml
These files are HDFS config files, but since the VM cluster and the spark cluster (which contains the HDFS) are separate ones (the presto on the VM cluster should access the HDFS that resides on the spark cluster), the question is –
should these files be copied from the spark cluster to the VM cluster?
Regarding to your question - My question is - can we install presto on the VM cluster and although the HDFS is not on the ESX cluster (but instead on the spark cluster)?
The answer is YES
On this cluster that isn't co hosted with HDFS , don't forget to set the fowling parameter in hive.properties
hive.force-local-scheduling=false
As long as the Presto VMs are configured as edge nodes (aka gateway nodes) and have all the necessary config files and tools you shouldn't have any problem. For details on edge nodes see:
Do we need to Install Hadoop on Edge Node
How to create an Edge Node when creating a cloudera cluster

Is it worth deploying Spark on YARN if I have no other cluster software?

I have a Spark cluster running in standalone mode. I am currently executing code on using Jupyter notebook calling pyspark. Is there a benefit to using YARN as the cluster manager, assuming that the machines are not doing anything else?
Would I get better performance using YARN? If so, why?
Many thanks,
John
I'd say YES by considering these points.
Why Run on YARN?
Using YARN as Spark’s cluster manager confers a few benefits over Spark standalone:
You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
Any how Spark standalone mode also requires worker for slave activity which can not run non Spark applications, where as with YARN, this is isolated in containers, so adoption of another compute framework should be a code change instead of infra + code. So the cluster can be shared among different frameworks.
YARN is the only cluster manager for Spark that supports security. With
YARN, Spark can run against Kerberized Hadoop clusters and uses
secure authentication between its processes.
YARN allows you to dynamically share and centrally configure the same
pool of cluster resources between all frameworks that run on YARN.
You can throw your entire cluster at a MapReduce job, then use some
of it on an Impala query and the rest on Spark application, without
any changes in configuration.
I would say 1,2 and 3 are suitable for mentioned scenarios but not point 4 as we assumed no other frameworks are going to use the cluster.
souce

Resources