Does any of Cloudera Hadoop distribution supports Apache Spark SQL - apache-spark

I am new to Apache Spark. I heard that none of the versions of CDH are supposrting Apache Spark SQL as of now, same case with hortonworks distribution as well. Is that true..?
And another one is I have CDH 5.0.0 installed in my PC, which version of Apache Spark my CDH supports..?
Also could someone please provide me the steps to execute my Spark program in my CDH distribution. I have written some basic programs using Apache Spark 1.2 version and I am not able to run those programs in CDH environment, i am facing very basic problem when I am running Spark program using spark-submit command
spark-submit: Command not found
Do i need to configure anything prior to run my Spark program..?
Thanks in advance

All of the distributions of CDH include the whole Spark distribution, including Spark SQL.
EDIT: It is supported as of CDH 5.5.x.
CDH 5.0.x includes Spark 0.9.x. CDH 5.3.x includes Spark 1.2.x and 5.4.x should ship 1.3.x since it is about to be released upstream.
spark-submit is already part of your path if you are using CDH. If you're running from somewhere else, you have to put this file on your path or give the full path to it. This is the same as any program. So, this is something wrong with what you set up.

Related

Is it possible to use Hadoop 3.x and Hive 3.x using spark 2.4?

We use spark 2.4.0 to connect to Hadoop 2.7 cluster and query from Hive Metastore version 2.3. But the Cluster managing team has decided to upgrade to Hadoop 3.x and Hive 3.x. We could not migrate to spark 3 yet, which is compatible with Hadoop 3 and Hive 3, as we could not test if anything breaks.
Is there any possible way to stick to spark 2.4.x version and still be able to use Hadoop 3 and Hive 3?
I got to know backporting is one option, It would be great if you could point me in that direction.
You can compile Spark 2.4 with Hadoop 3.1 profile instead of relying on default version. You need to use hadoop-3.1 profile as described in documentation on building Spark, something like:
./build/mvn -Pyarn -Phadoop-3.1 -DskipTests clean package

spark2-submit different from spark-submit

How is spark2-submit different from spark submit. I need to migrate my code from spark 1.6 to spark 2.4 Can I still use spark-submit to launch my application or is it compulsory to move to spark2-submit.
I think you are using Cloudera Hadoop. Spark 2.x versions had major changes compared to 1.x versions. In a way, there are compatibility issues. So when your existing production jobs that used 1.x version runs on 2.x, there are more chances that your job may fail.
Just to provide backward compatibility, Cloudera added "spark2-submit" and asked the users to use it for all "go-forward" jobs. And "spark-submit" would still use 1.x version and you need not touch any of the production jobs.
So it is just for the compatibility reasons.
You can use spark-submit for Spark 2.X after setting following environment variables:
1) SPARK_HOME to path of spark2-client (e.g. /usr/hdp/current/spark2-client)
2) SPARK_MAJOR_VERSION=2
Using these two configuration, even if you have both Spark 1.x and Spark 2.x installed on Cluster, you can run jobs using Spark 2.x by same commands like spark-shell, spark-submit

If I already have Hadoop installed, should I download Apache Spark WITH Hadoop or WITHOUT Hadoop?

I already have Hadoop 3.0.0 installed. Should I now install the with-hadoop or without-hadoop version of Apache Spark from this page?
I am following this guide to get started with Apache Spark.
It says
Download the latest version of Apache Spark (Pre-built according to
your Hadoop version) from this link:...
But I am confused. If I already have an instance of Hadoop running in my machine, and then I download, install and run Apache-Spark-WITH-Hadoop, won't it start another additional instance of Hadoop?
First off, Spark does not yet support Hadoop 3, as far as I know. You'll notice this by no available option for "your Hadoop version" available for download.
You can try setting HADOOP_CONF_DIR and HADOOP_HOME in your spark-env.sh, though, regardless of which you download.
You should always download the version without Hadoop if you already have it.
won't it start another additional instance of Hadoop?
No. You still would need to explicitly configure and start that version of Hadoop.
That Spark option is already configured to use the included Hadoop, I believe
This is in addition to the answer by #cricket_007.
If you have Hadoop installed, do not download spark with Hadoop, however, as your Hadoop version is still unsupported by any version of spark, you will need to download the one with Hadoop. Although, you will need to configure the bundled Hadoop version on your machine for Spark to run on. This will mean that all your data on the Hadoop 3 will be LOST. So, If you need this data, please take a backup of the data before beginning your downgrade/re-configuration. I do not think you will be able to host 2 instances of Hadoop on the same system because of certain environment variables.

Spark version 1.4.1 instead of 1.5.2 HdInsight

I've set up an Azure HdInsight Linux Spark Cluster which should come with the 1.5.2 version of Spark. However, if I try to output the version number by using sparkContext.version I keep getting 1.4.1.
Is there maybe a step I'm missing with the configuration? I've been booting up clusters both through the GUI and through scripts.
This is a known bug on our side. The binary version is actually 1.5.2 but it is being reported incorrectly.

Apache Zeppelin & Spark Streaming: Twitter Example only works local

I just added the example project to my Zeppelin Notebook from http://zeppelin-project.org/docs/tutorial/tutorial.html (section "Tutorial with Streaming Data"). The problem I now have is that the application seems only to work local. If I change the Spark interpreter setting "master" from "local[*]" to "spark://master:7077" the application won't bring any result anymore when I'm doing the same SQL statement. Am I doing anything wrong? I already restarted the Zeppelin interpreter, also the whole Zeppelin daemon and the Spark cluster, nothing solved the issue! Can someone help.
I use the following installation:
Spark 1.5.1 (prebuild for Hadoop 2.6+), Master + 2x Slaves
Zeppelin 0.5.5 (installed on Spark's master node)
EDIT
Also the following installation won't work for me:
Spark 1.5.0 (prebuild for Hadoop 2.6+), Master + 2x Slaves
Zeppelin 0.5.5 (installed on Spark's master node)
Screenshot: local setting (works!)
Screenshot: cluster setting (won't work!)
The job seems to run correctly in cluster mode:
I got it after 2 days of trying around!
The difference between the local Zeppelin Spark interpreter and the Spark Cluster seems to be, that the local one has included the Twitter Utils which are needed for executing the Twitter Streaming example, and the Spark Cluster doesn't have this library by default.
Therefore you have to add the dependency manually in the Zeppelin Notebook before starting the application with Spark cluster as master. So the first paragraph of the Notebook must be:
%dep
z.reset
z.load("org.apache.spark:spark-streaming-twitter_2.10:1.5.1")
If an error occures on running this paragraph, just try to restart the Zeppelin server via ./bin/zeppelin-daemon.sh stop (& start)!

Resources