What is Databricks Spark cluster manager? Can it be changed? - apache-spark

Original Spark distributive supports several cluster managers like YARN, Mesos, Spark Standalone, K8s.
I can't find what is under the hood in Databricks Spark, which cluster manager it is using, and is it possible to change?
What's Databricks Spark architecture?
Thanks.

You can't check the cluster manager in Databricks, and you really don't need that because this part is managed for you. You can think about it as a kind of standalone cluster, but there are differences. General Databricks architecture is shown here.
You can change the cluster configuration by different means - init scripts, configuration parameters, etc. See documentation for more details.

Related

Suggestions for using Spark in AWS without any cluster?

I want to leverage a couple Spark APIs to convert some data in an EC2 container. We deploy these containers on Kubernetes.
I'm not super familiar with Spark, but I see there are requirements on Spark context, etc. Is it possible for me to just leverage the Spark APIs/RDDs, etc without needing any cluster? I just have a simple script I want to run that leverages Spark. I was thinking I could somehow fatjar this dependency or something, but not quite sure what I'm looking for.
Yes, you need a cluster to run spark.
Cluster is nothing more than a platform to install Spark
I think your question should be "Can Spark run on single/standalone node ?"
If this you want to know then yes spark can run on standalone node as spark has its own stand alone cluster.
"but I see there are requirements on Spark context":
SparkContext is the entry point of spark, you need to create it in order to use any spark function.

Run parallel jobs on-prem dynamic spark clusters

I am new to spark, And we have a requirement to set up a dynamic spark cluster to run multiple jobs. by referring to some articles, we can achieve this by using EMR (Amazon) service.
Is there any way to the same setup that can be done locally?
Once Spark clusters are available with services running on different ports on different servers, how to point mist to new spark cluster for each job.
Thanks in advance.
Yes, you can use a Standalone cluster that Spark provides where you can set up Spark Cluster (master nodes and slave nodes). There are also docker containers that can be used to achieve that. Take a look here.
Other options it will be to take and deploy locally Hadoop ecosystems, like MapR, Hortonworks, Cloudera.

While creating Azure HDInsight cluster for Starburst Presto, can I create Spark Cluster?

While creating infrastructure for big data, I wanted to use Azure HDInsight with Presto installation. Azure HDInsight comes with different flavors like hadoop, spark etc. In documentations it is recommended to use hadoop cluster but I want to use the spark one.
Is it possible to use spark cluster with Starburst's Presto distribution?
It looks like you want to use both Presto and Spark at the same time.
If you run them on a single cluster, you would need to configure them appropriately to make sure the JVMs for different processes can co-exist. This is possible, but hard to do in practice (you need to know how JVM allocates memory beyond -Xmx setting), so it's definitely not recommended.
While I can imagine that in some on-premises installations where provisioning new hardware is hard you could want to colocate services on one cluster. In the cloud, it's much more convenient to provision two separate clusters, each appropriately sized for your particular needs and workload. For example, you could have one cluster with Presto for interactive analytics, dashboarding and ad-hoc queries. And another one with Spark for your machine learning or ETL workloads.
Please refer to the Starburst Presto on Azure documentation for detailed configuration instructions.

Is it worth deploying Spark on YARN if I have no other cluster software?

I have a Spark cluster running in standalone mode. I am currently executing code on using Jupyter notebook calling pyspark. Is there a benefit to using YARN as the cluster manager, assuming that the machines are not doing anything else?
Would I get better performance using YARN? If so, why?
Many thanks,
John
I'd say YES by considering these points.
Why Run on YARN?
Using YARN as Spark’s cluster manager confers a few benefits over Spark standalone:
You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
Any how Spark standalone mode also requires worker for slave activity which can not run non Spark applications, where as with YARN, this is isolated in containers, so adoption of another compute framework should be a code change instead of infra + code. So the cluster can be shared among different frameworks.
YARN is the only cluster manager for Spark that supports security. With
YARN, Spark can run against Kerberized Hadoop clusters and uses
secure authentication between its processes.
YARN allows you to dynamically share and centrally configure the same
pool of cluster resources between all frameworks that run on YARN.
You can throw your entire cluster at a MapReduce job, then use some
of it on an Impala query and the rest on Spark application, without
any changes in configuration.
I would say 1,2 and 3 are suitable for mentioned scenarios but not point 4 as we assumed no other frameworks are going to use the cluster.
souce

YARN vs Spark processing engine based on real time application?

I understood YARN and Spark. But I want to know when I need to use Yarn and Spark processing engine. What are the different case studies in that I can identify the difference between YARN and Spark?
You cannot compare Yarn and Spark directly per se. Yarn is a distributed container manager, like Mesos for example, whereas Spark is a data processing tool. Spark can run on Yarn, the same way Hadoop Map Reduce can run on Yarn. It just happens that Hadoop Map Reduce is a feature that ships with Yarn, when Spark is not.
If you mean comparing Map Reduce and Spark, I suggest reading this other answer.
Apache Spark can be run on YARN, MESOS or StandAlone Mode.
Spark in StandAlone mode - it means that all the resource management and job scheduling are taken care Spark inbuilt.
Spark in YARN - YARN is a resource manager introduced in MRV2, which not only supports native hadoop but also Spark, Kafka, Elastic Search and other custom applications.
Spark in Mesos - Spark also supports Mesos, this is one more type of resource manager.
Advantages of Spark on YARN
YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.
YARN schedulers can be used for spark jobs, Only With YARN, Spark can run against Kerberized Hadoop clusters and uses secure authentication between its processes.
Link for more documentation on YARN, Spark.
We can conclude saying this, if you want to build a small and simple cluster independent of everything go for standalone. If you want to use existing hadoop cluster go for YARN/Mesos.

Resources