Can I have an apache Spark cluster where different nodes run different versions of Spark? For example, could I have a master which is Spark 2.2.0, one node that is 2.0.1, another that is 2.2.0 and another that is 1.6.3 or should all nodes have the same version of Spark?
Usually when we want to install different versions of spark on the cluster, all the versions will be installed on all the nodes, spark execution depends on which spark-submit (spark 1.6 or spark 2.0 or spark 2.2) is used while running the script.
Lets say we have installed spark 1.6 on master node only, when we submit the job on the cluster, say master node is fully utilized , then yarn-resource manager will see which node is free to run the job, here yarn will not wait until master node gets some resources,yarn will submit the job to the node which has free resources. So, for this reason all versions of spark has to be installed on all nodes on the cluster.
Can I have an apache Spark cluster where different nodes run different versions of Spark?
No. This is not possible.
The reason is that there is no notion of Spark installation. Spark is a library and as such is a dependency of an application that once submitted for execution will be deployed and executed on cluster nodes (at least one, i.e. the driver).
With that said, just the version of the Spark dependency of your application is exactly the version of Spark in use. To be precise, the version of spark-submit in use (unless you use so-called a uber-jar with the Spark dependency bundled).
Related
I am new to Spark and learning the architecture. I understood that spark supports 3 cluster managers such as YARN, Standalone and Mesos.
In yarn cluster mode, Spark driver resides in Resource manager and executors in yarn's Containers of Node manager.
In standalone cluster mode Spark driver resides in master process and executors in slave process.
If my understanding is correct then is it required to install spark on all the node Mangers of Yarn cluster , slave nodes of standalone cluster
If you use yarn as manager on a cluster with multiple nodes you do not need to install spark on each node. Yarn will distribute the spark binaries to the nodes when a job is submitted.
https://spark.apache.org/docs/latest/running-on-yarn.html
Running Spark on YARN requires a binary distribution of Spark which is built with YARN support. Binary distributions can be downloaded from the downloads page of the project website. To build Spark yourself, refer to Building Spark.
To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer to Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.
How is spark2-submit different from spark submit. I need to migrate my code from spark 1.6 to spark 2.4 Can I still use spark-submit to launch my application or is it compulsory to move to spark2-submit.
I think you are using Cloudera Hadoop. Spark 2.x versions had major changes compared to 1.x versions. In a way, there are compatibility issues. So when your existing production jobs that used 1.x version runs on 2.x, there are more chances that your job may fail.
Just to provide backward compatibility, Cloudera added "spark2-submit" and asked the users to use it for all "go-forward" jobs. And "spark-submit" would still use 1.x version and you need not touch any of the production jobs.
So it is just for the compatibility reasons.
You can use spark-submit for Spark 2.X after setting following environment variables:
1) SPARK_HOME to path of spark2-client (e.g. /usr/hdp/current/spark2-client)
2) SPARK_MAJOR_VERSION=2
Using these two configuration, even if you have both Spark 1.x and Spark 2.x installed on Cluster, you can run jobs using Spark 2.x by same commands like spark-shell, spark-submit
I have just moved from a Spark local setup to a Spark standalone cluster. Obviously, loading and saving files no longer works.
I understand that I need to use Hadoop for saving and loading files.
My Spark installation is spark-2.2.1-bin-hadoop2.7
Question 1:
Am I correct that I still need to separately download, install and configure Hadoop to work with my standalone Spark cluster?
Question 2:
What would be the difference between running with Hadoop and running with Yarn? ...and which is easier to install and configure (assuming fairly light data loads)?
A1. Right. The package you mentioned is just packed with hadoop client with specified version and still you need to install hadoop if you want to use hdfs.
A2. Running with yarn means you're using resource manager of spark as yarn. (http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-across-applications) So, when the case you don't need DFS, like when you're only running spark streaming applications, you still can install Hadoop but only run yarn processes to use its resource management functionality.
I am aware of the fact that with the change of major version of spark (i.e. from 1.* to 2.*) there will be compile time failures due to changes in existing APIs.
As per my knowledge spark guarantees that with minor version update (i.e. 2.0.* to 2.2.*), changes will be backward compatible.
Although this will eliminate the possibility of compile-time failures with upgrade, would it be safe to assume that there won't be any run time failure too if submit a job on spark 2.2.* stand alone cluster using an artifact(jar) created using 2.0.* dependencies?
would it be safe to assume that there won't be any run time failure too if submit a job on 2.2.* cluster using an artifact(jar) created using 2.0.* dependencies?
Yes.
I'd even say that there's no concept of a Spark cluster unless we talk about the built-in Spark Standalone cluster.
In other words, you deploy a Spark application to a cluster, e.g. Hadoop YARN or Apache Mesos, as a application jar that may or may not contain Spark jars and so disregard what's already available in the environment.
If however you do think of Spark Standalone, things may have been broken between releases even between 2.0 and 2.2 as the jars in your Spark application have to be compatible with the ones on JVM of Spark workers (they are already pre-loaded).
I would not claim full compatibility between releases of Spark Standalone.
I understood YARN and Spark. But I want to know when I need to use Yarn and Spark processing engine. What are the different case studies in that I can identify the difference between YARN and Spark?
You cannot compare Yarn and Spark directly per se. Yarn is a distributed container manager, like Mesos for example, whereas Spark is a data processing tool. Spark can run on Yarn, the same way Hadoop Map Reduce can run on Yarn. It just happens that Hadoop Map Reduce is a feature that ships with Yarn, when Spark is not.
If you mean comparing Map Reduce and Spark, I suggest reading this other answer.
Apache Spark can be run on YARN, MESOS or StandAlone Mode.
Spark in StandAlone mode - it means that all the resource management and job scheduling are taken care Spark inbuilt.
Spark in YARN - YARN is a resource manager introduced in MRV2, which not only supports native hadoop but also Spark, Kafka, Elastic Search and other custom applications.
Spark in Mesos - Spark also supports Mesos, this is one more type of resource manager.
Advantages of Spark on YARN
YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.
YARN schedulers can be used for spark jobs, Only With YARN, Spark can run against Kerberized Hadoop clusters and uses secure authentication between its processes.
Link for more documentation on YARN, Spark.
We can conclude saying this, if you want to build a small and simple cluster independent of everything go for standalone. If you want to use existing hadoop cluster go for YARN/Mesos.