Installing Spark 2 on CDH 5.* with RPM?

Installing Spark 2 on CDH 5.* with RPM? - apache-spark

I have a Cloudera CDH 5.11 cluster installed from RPM packages (we don't want to use Cloudera Manager or parcels). Has anyone found/built Spark 2 RPM packages for CDH? It seems Cloudera only ships Spark 2 as parcels.

You won't. For now, the doc "Spark 2 Known Issues" clearly states:
Package Install is not Supported
The Cloudera Distribution of Apache Spark 2 is only installable as a parcel.
https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html#ki_package_install

The best way is to use Spark on Yarn instead of using Spark Master/Worker. You are free to use any Spark version you like, independent of what the vendor ships.
What you need to do is to package Spark History Server to be able to look at jobs after they finishes. And, if you want to use Dynamic Allocation, you need Spark Shuffle Service configured in Yarn.

Looks like I can't comment on an issue so excuse this post as an answer.
Is it possible to install the Spark2 parcel on a RPM installed cluster using CM?

From CDH 6.0 Spark 2 is included as RPMs. Problem solved.

Related

Elasticsearch for spark 3.0

Im getting issues while using spark3.0 for reading elastic.
My elasticsearch version 7.6.0
I used elastic jar of the same version.
Please suggest a solution.

Spark 3.0.0 relies on Scala 2.12, which is not yet supported by Elasticsearch-hadoop. This and a few further issues prevent us from using Spark 3.0.0 together with Elasticsearch. If you want to compile it yourself, there is a pull-request on elasticsearch-hadoop (https://github.com/elastic/elasticsearch-hadoop/pull/1308) which should at least allow using scala 2.12. Not sure if it will fix the other issues as well.

It's officially released for spark 3.0
Enhancements:
https://www.elastic.co/guide/en/elasticsearch/hadoop/7.12/eshadoop-7.12.0.html
Maven Repository:
https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-spark-30_2.12/7.12.0

It is not official for now, but you can compile the dependency on
https://github.com/elastic/elasticsearch, the steps are
git clone https://github.com/elastic/elasticsearch.git
cd elasticsearch-hadoop/
vim ~/.bashrc
export JAVA8_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
source ~/.bashrc
./gradlew elasticsearch-spark-30:distribution --console=plain
and finally you can find .jar package in folder: "elasticsearch-hadoop\spark\sql-30\build\distributions", elasticsearch-spark-30_2.12-8.0.0-SNAPSHOT.jar is the es packages

Cloudera CDH 5.13 - is it possible to run spark 2.x with Yarn mode?

Team,
Installed CDH 5.13 version on my local computer and doing upgrade from spark 1.6 to spark 2.0. is it possible to do run spark application using yarn mode. Please confirm . or it will work as standalone mode.

Yes. Both modes are supported.
From CDH 6.x Stand Alone is no longer supported.

YARN supports stand alone mode and cluster mode. I have used both of them and they work fine on Cloudera 5.11 + Spark2 setup.

Can PySpark work without Spark?

I have installed PySpark standalone/locally (on Windows) using
pip install pyspark
I was a bit surprised I can already run pyspark in command line or use it in Jupyter Notebooks and that it does not need a proper Spark installation (e.g. I did not have to do most of the steps in this tutorial https://medium.com/#GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c ).
Most of the tutorials that I run into say one needs to "install Spark before installing PySpark". That would agree with my view of PySpark being basically a wrapper over Spark. But maybe I am wrong here - can someone explain:
what is the exact connection between these two technologies?
why is installing PySpark enough to make it run? Does it actually install Spark under the hood? If yes, where?
if you install only PySpark, is there something you miss (e.g. I cannot find the sbin folder which contains e.g. script to start history server)

As of v2.2, executing pip install pyspark will install Spark.
If you're going to use Pyspark it's clearly the simplest way to get started.
On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars

PySpark installed by pip is a subfolder of full Spark. you can find most of PySpark python file in spark-3.0.0-bin-hadoop3.2/python/pyspark. so if you'd like to use java or scala interface, and deploy distribute system with hadoop, you must download full Spark from Apache Spark and install it.

PySpark has a Spark installation installed. If installed through pip3, you can find it with pip3 show pyspark. Ex. for me it is at ~/.local/lib/python3.8/site-packages/pyspark.
This is a standalone configuration so it can't be used for managing clusters like a full Spark installation.

How to upgrade Apache Spark version

Currently, I have installed Spark 1.5.0 version on AWS using spark-ec2.sh script.
Now, I want to upgrade my Spark version to 1.5.1. How do i do this? Is there any upgrade procedure or do i have to build it from scratch using the spark-ec2 script? In that case i will lose all my existing configuration.
Please Advise
Thanks

1.5.1 has identical configuration fields with the 1.5.0, I am not aware of any automation tools, but upgrade should be trivial. C/P $SPARK_HOME/conf should suffice. Back up the old files, nevertheless.

How to connect Zeppelin to Spark 1.5 built from the sources?

I pulled the latest source from the Spark repository and built locally. It works great from an interactive shell like spark-shell or spark-sql.
Now I want to connect Zeppelin to my Spark 1.5, according to this install manual. I published the custom Spark build to the local maven repository and set the custom Spark version in the Zeppelin build command. The build process finished successfully but when I try to run basic things like sc inside notebook, it throws:
akka.ConfigurationException: Akka JAR version [2.3.11] does not match the provided config version [2.3.4]
Version 2.3.4 is set in pom.xml and spark/pom.xml, but simply changing them won’t even let me get a build.
If I rebuild Zeppelin with the standard -Dspark.vesion=1.4.1, everything works.

Update 2016-01
Spark 1.6 support has landed to master and is available under -Pspark-1.6 profile.
Update 2015-09
Spark 1.5 support has landed to master and is available under -Pspark-1.5 profile.
Work on supporting Spark 1.5 in Apache Zeppelin (incubating) was done under this PR apache/incubator-zeppelin#269 which will lend to master soon.
For now, building from Spark_1.5 branch with -Pspark-1.5 should do the trick.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Installing Spark 2 on CDH 5.* with RPM? - apache-spark

I have a Cloudera CDH 5.11 cluster installed from RPM packages (we don't want to use Cloudera Manager or parcels). Has anyone found/built Spark 2 RPM packages for CDH? It seems Cloudera only ships Spark 2 as parcels.

You won't. For now, the doc "Spark 2 Known Issues" clearly states: Package Install is not Supported The Cloudera Distribution of Apache Spark 2 is only installable as a parcel. https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html#ki_package_install

Looks like I can't comment on an issue so excuse this post as an answer. Is it possible to install the Spark2 parcel on a RPM installed cluster using CM?

From CDH 6.0 Spark 2 is included as RPMs. Problem solved.

Related

Elasticsearch for spark 3.0

Cloudera CDH 5.13 - is it possible to run spark 2.x with Yarn mode?

Can PySpark work without Spark?

How to upgrade Apache Spark version

How to connect Zeppelin to Spark 1.5 built from the sources?

Categories

Resources