Configuring CDH cluster with Python 3 - python-3.x

We are using CDH 5.8.3 community version and we want to add support for Python 3.5+ to our cluster
I know that Cloudera and Anaconda has such parcel to support Python, but this parcel support Python version 2.7.
What is the recommended way to enable Python version 3+ on CDH cluster?

Related

How to install Apache Spark on Windows 10?

I am trying to install spark-3.3.1-bin-hadoop3 "Prebuilt for Apache Hadoop 3.3 and later" that I have downloaded from https://spark.apache.org/downloads.html , On my Windows 10 machine.
But when I search for winutils the latest version I could find is 3.3.1 from here: https://github.com/kontext-tech/winutils. Also there are other winutils with lower versions on other Github pages.
As the latest version of Hadoop is 3.3.4 I don't know what to do? I even don't know should I install Hadoop before installing Spark? Or what Github page is the official page for winutils?

Install Spark 3.* on HDP with Ambari

We need to install Spark 3.* on HDP 3.1.5 but i can't to find some instructions.
I find this instruction
https://community.cloudera.com/t5/Community-Articles/Steps-to-install-supplementary-Spark-on-HDP-cluster/ta-p/244199
Does this work for spark 3?
How to add this service to ambari?
I'm need help

Problem to Install pyspark of version 2.3

I was trying to install pyspark 2.3 from the last couple days. But I have found out Version 3.0.1 and 2.4.7 only so far. Actually I was trying to run a code implemented in pyspark 2.3 as a part of my project. Is that version still available now ? Please send me the essential resources to install pyspark 2.3 if it is available to install as well as shareable. As it seems tough to me to implement that code in version 3.0.1.
Pyspark 2.3 should still be available via Conda-Forge.
Please checkout https://anaconda.org/conda-forge/pyspark/files?version=2.3.2
There you will find the following and more packages for a direct download:
linux-64/pyspark-2.3.2-py36_1000.tar.bz2
win-64/pyspark-2.3.2-py36_1000.tar.bz2
If you don't want the raw packages, you can also install it via conda:
conda install -c conda-forge pyspark=2.3.2

How to change python version in apache toree pyspark notebook?

I am running Apache Toree for Pyspark Notebook. I had anaconda 3.5 and jupyter hub installed on unix machines. When I am invoking pyspark from Jupyter notebook it's starting with Python 2.7 instead of Anaconda 3.5.
Requesting your help in changing python version.
Please see I had already tried changing python version via os.environ but it didn't worked.
Followed Below steps for configuring Toree with Python-3:
Installed a new kernel with spark home and python path.
jupyter toree install --spark_home="spark_path" --kernel_name=tanveer_kernel1 --interpreters=PySpark,SQL --python="python_path"
After doing above there were issues with Driver Python version and Executor Python version. Corrected Python Version in spark-env.sh by adding
export PYSPARK_PYTHON="/usr/lib/anaconda3/bin/python"
export PYSPARK_DRIVER_PYTHON="/usr/lib/anaconda3/bin/python"
Restarted spark services.

installing spark 1.4.1 in CDH 5.4.2

I am very new to spark and I want to install latest version of spark on my VM. Can anyone please guide me on how to install spark 1.4.1 on my cloudera VM version 5.4.2. I have currently spark 1.3.0 installed (Default that comes already installed in CDH 5.4.2) on my cloudera.
Thank you.
Officially you will need to wait for Cloudera to release (and support) the newer version of Spark with CDH.
If you need a newer version of Spark before then you can download Spark yourself and install it alongside CDH.
http://spark.apache.org/downloads.html
You can still use the other CDH Hadoop systems (e.g. HDFS, Hive, etc) from a separate Spark installation.

Resources