I've set up an Azure HdInsight Linux Spark Cluster which should come with the 1.5.2 version of Spark. However, if I try to output the version number by using sparkContext.version I keep getting 1.4.1.
Is there maybe a step I'm missing with the configuration? I've been booting up clusters both through the GUI and through scripts.
This is a known bug on our side. The binary version is actually 1.5.2 but it is being reported incorrectly.
Related
We use spark 2.4.0 to connect to Hadoop 2.7 cluster and query from Hive Metastore version 2.3. But the Cluster managing team has decided to upgrade to Hadoop 3.x and Hive 3.x. We could not migrate to spark 3 yet, which is compatible with Hadoop 3 and Hive 3, as we could not test if anything breaks.
Is there any possible way to stick to spark 2.4.x version and still be able to use Hadoop 3 and Hive 3?
I got to know backporting is one option, It would be great if you could point me in that direction.
You can compile Spark 2.4 with Hadoop 3.1 profile instead of relying on default version. You need to use hadoop-3.1 profile as described in documentation on building Spark, something like:
./build/mvn -Pyarn -Phadoop-3.1 -DskipTests clean package
I am plannig to upgrade from Hortonworks Data platform[HDP] (version 2.6.x) to HDP 3.0. But, there seems to be some major bugs in Apache Spark 2.3.x and its integration with Hadoop 3.0, which are still unresolved in Apache Spark JIRA issues. Although the Spark development team is working to resolve them. Do these issues have a workaround/resolutions by Hortonworks team, or do they still exist in HDP 3.0?
Some unresolved issues concerning my use case:
Spark DataFrames does not work with Hadoop 3.0 https://issues.apache.org/jira/browse/SPARK-18673
Kerberos Ticket renewal fails in Hadoop 3 https://issues.apache.org/jira/browse/SPARK-24493
Spark run on Hadoop 3 https://issues.apache.org/jira/browse/SPARK-23534
I checked integration with HDP Spark-2.3.1 and Hadoop - 3.0.1. It works perfectly and above issues were resolved in HDP version of Spark, but were not provided in HDP-3 release notes.
Check the community answer
i searched but couldn't get a concrete difference between the Apache distribution of spark 2 and the Cloudera distribution of spark 2. Can anybody help me on this in understanding the differences they have in spark core, spark sql and spark streaming.
They are referring to the same thing. Cloudera distributes a packaged version of Hadoop including Apache Spark 2. There are slight differences in this Apache Spark 2 and the latest upstream version of Spark 2 from https://spark.apache.org/. These are usually spelled out in the Release Notes for CDH Spark 2.
For example, the release notes have a section called: Spark 2 Known Issues which describe some missing features.
In general, incompatibilities arise because there is a lag between upstream releases and CDH releases and CDH has to maintain major version compatibility between minor releases.
I already have Hadoop 3.0.0 installed. Should I now install the with-hadoop or without-hadoop version of Apache Spark from this page?
I am following this guide to get started with Apache Spark.
It says
Download the latest version of Apache Spark (Pre-built according to
your Hadoop version) from this link:...
But I am confused. If I already have an instance of Hadoop running in my machine, and then I download, install and run Apache-Spark-WITH-Hadoop, won't it start another additional instance of Hadoop?
First off, Spark does not yet support Hadoop 3, as far as I know. You'll notice this by no available option for "your Hadoop version" available for download.
You can try setting HADOOP_CONF_DIR and HADOOP_HOME in your spark-env.sh, though, regardless of which you download.
You should always download the version without Hadoop if you already have it.
won't it start another additional instance of Hadoop?
No. You still would need to explicitly configure and start that version of Hadoop.
That Spark option is already configured to use the included Hadoop, I believe
This is in addition to the answer by #cricket_007.
If you have Hadoop installed, do not download spark with Hadoop, however, as your Hadoop version is still unsupported by any version of spark, you will need to download the one with Hadoop. Although, you will need to configure the bundled Hadoop version on your machine for Spark to run on. This will mean that all your data on the Hadoop 3 will be LOST. So, If you need this data, please take a backup of the data before beginning your downgrade/re-configuration. I do not think you will be able to host 2 instances of Hadoop on the same system because of certain environment variables.
I am new to Apache Spark. I heard that none of the versions of CDH are supposrting Apache Spark SQL as of now, same case with hortonworks distribution as well. Is that true..?
And another one is I have CDH 5.0.0 installed in my PC, which version of Apache Spark my CDH supports..?
Also could someone please provide me the steps to execute my Spark program in my CDH distribution. I have written some basic programs using Apache Spark 1.2 version and I am not able to run those programs in CDH environment, i am facing very basic problem when I am running Spark program using spark-submit command
spark-submit: Command not found
Do i need to configure anything prior to run my Spark program..?
Thanks in advance
All of the distributions of CDH include the whole Spark distribution, including Spark SQL.
EDIT: It is supported as of CDH 5.5.x.
CDH 5.0.x includes Spark 0.9.x. CDH 5.3.x includes Spark 1.2.x and 5.4.x should ship 1.3.x since it is about to be released upstream.
spark-submit is already part of your path if you are using CDH. If you're running from somewhere else, you have to put this file on your path or give the full path to it. This is the same as any program. So, this is something wrong with what you set up.