Is there a way to use PySpark with Hadoop 2.8+? - apache-spark

I would like to run a PySpark job locally, using a specific version of Hadoop (let's say hadoop-aws 2.8.5) because of some features.
PySpark versions seem to be aligned with Spark versions.
Here I use PySpark 2.4.5 which seems to wrap a Spark 2.4.5.
When submitting my PySpark Job, using spark-submit --local[4] ..., with the option --conf spark.jars.packages=org.apache.hadoop:hadoop-aws:2.8.5, I encounter the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o32.sql
With the following java exceptions:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
Or:
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init (Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
I suppose that the Pyspark Job Hadoop version is unaligned with the one I pass to the spark-submit option spark.jars.packages.
But I have not any idea of how I could make it work? :)

Default spark disto has hadoop libraries included. Spark use system (its own) libraries first. So you should either set --conf spark.driver.userClassPathFirst=true and for cluster add --conf spark.executor.userClassPathFirst=true or download spark distro without hadoop. Probably you will have to put your hadoop distro into spark disto jars directory.

Ok, I found a solution:
1 - Install Hadoop in the expected version (2.8.5 for me)
2 - Install a Hadoop Free version of Spark (2.4.4 for me)
3 - Set SPARK_DIST_CLASSPATH environment variable, to make Spark uses the custom version of Hadoop.
(cf. https://spark.apache.org/docs/2.4.4/hadoop-provided.html)
4 - Add the PySpark directories to PYTHONPATH environment variable, like the following:
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
(Note that the py4j version my differs)
That's it.

Related

How to initialise PySpark on AWS Cloud9

I want to initialise pyspark version 3.3.1 on aws cloud9 and to read a s3 file path from AWS. But when I run the code, I got this error shown in the image.
I was thinking that there is something wrong with my Pyspark initilisation, and I have tried the code below provided by my colleague but apparently this doesn't work for me. enter image description here
My pyspark version is 3.3.1 and hadoop version 3
pkg_list=org.apache.spark:spark-avro_2.11:2.4.4,org.apache.hadoop:hadoop-aws:2.7.1
pyspark --packages $pkg_list --driver-memory 32G --driver-cores 8 --num-executors 8 --executor-memory 32G --executor-cores 8 --driver-java-options="-Djava.io.tmpdir=/home/yoongkiat/tempfiles"
The error is saying that in some hadoop config file or option that Spark is using, you have a string 64M, but it's only expecting a number.
The error doesn't say which file, and that's not a value you've provided on the command line, so you'll need to debug the installation on your own. As mentioned in comments, AWS EMR already offers a functional Spark environment.
By the, you cannot use dependencies from different Spark versions; you're running 3.3.1, but trying to add spark-avro for 2.4.4. I'm also not certain you'll need to add hadoop-aws since Spark should have those libraries included out of the box.

where is local hadoop folder in pyspark (mac)

I have installed pyspark in local mac using homebrew. I am able to see spark under /usr/local/Cellar/apache-spark/3.2.1/
but not able to see hadoop folder. If I run pyspark in terminal it is running spark shell.
Where can I see its path?
I a trying to connect S3 to pyspark and I have dependency jars
You do not need to know the location of Hadoop to do this.
You should use a command like spark-submit --packages org.apache.hadoop:hadoop-aws:3.3.1 app.py instead, which will pull all necessary dependencies rather than download all JARs (with their dependencies) locally.

Spark on kubernetes with zeppelin

I am following this guide to run up a zeppelin container in a local kubernetes cluster set up using minikube.
https://zeppelin.apache.org/docs/0.9.0-SNAPSHOT/quickstart/kubernetes.html
I am able to set up zeppelin and run some sample code there. I have downloaded spark 2.4.5 & 2.4.0 source code and built it for kubernetes support with the following command:
./build/mvn -Pkubernetes -DskipTests clean package
Once spark is built I created a docker container as explained in the article:
bin/docker-image-tool.sh -m -t 2.4.X build
I configured zeppelin to use the spark image which was built with kubernetes support. The article above explains that the spark interpreter will auto configure spark on kubernetes to run in client mode and run the job.
But whenever I try to run any parahgraph with spark I receive the following error
Exception in thread "main" java.lang.IllegalArgumentException: basedir must be absolute: ?/.ivy2/local
I tried setting the spark configuration spark.jars.ivy in zeppelin to point to a temp directory but that does not work either.
I found a similar issue here:
basedir must be absolute: ?/.ivy2/local
But I can't seem to configure spark to run with the spark.jars.ivy /tmp/.ivy config. I tried building spark with the spark-defaults.conf when building spark but that does not seems to be working either.
Quite stumped at this problem and how to solve it any guidance would be appreciated.
Thanks!
I have also run into this problem, but a work-around I used for setting spark.jars.ivy=/tmp/.ivy is to rather set it is as an environment variable.
In your spark interpreter settings, add the following property: SPARK_SUBMIT_OPTIONS and set its value to --conf spark.jars.ivy=/tmp/.ivy.
This should pass additional options to spark submit and your job should continue.

Use Apache Zeppelin with existing Spark Cluster

I want to install Zeppelin to use my existing Spark cluster. I used the following way:
Spark Master (Spark 1.5.0 for Hadoop 2.4):
Zeppelin 0.5.5
Spark Slave
I downladed the Zeppelin v0.5.5 and installed it via:
mvn clean package -Pspark-1.5 -Dspark.version=1.5.0 -Dhadoop.version=2.4.0 -Phadoop-2.4 -DskipTests
I saw, that the local[*] master setting works also without my Spark Cluster (notebook also runnable when shutted down the Spark cluster).
My problem: When I want to use my Spark Cluster for a Streaming application, it seems not to work correctly. My SQL-Table is empty when I use spark://my_server:7077 as master - in local mode everything works fine!
See also my other question which describes the problem: Apache Zeppelin & Spark Streaming: Twitter Example only works local
Did I something wrong
on installation via "mvn clean packge"?
on setting the master url?
Spark and/or Hadoop version (any limitations???)
Do I have to set something special in zeppelin-env.sh file (is actually back on defaults)???
The problem was caused by a missing library dependency! So before searching around too long, first check the dependencies, whether one is missing!
%dep
z.reset
z.load("org.apache.spark:spark-streaming-twitter_2.10:1.5.1")

Automatically including jars to PySpark classpath

I'm trying to automatically include jars to my PySpark classpath. Right now I can type the following command and it works:
$ pyspark --jars /path/to/my.jar
I'd like to have that jar included by default so that I can only type pyspark and also use it in IPython Notebook.
I've read that I can include the argument by setting PYSPARK_SUBMIT_ARGS in env:
export PYSPARK_SUBMIT_ARGS="--jars /path/to/my.jar"
Unfortunately the above doesn't work. I get the runtime error Failed to load class for data source.
Running Spark 1.3.1.
Edit
My workaround when using IPython Notebook is the following:
$ IPYTHON_OPTS="notebook" pyspark --jars /path/to/my.jar
You can add the jar files in the spark-defaults.conf file (located in the conf folder of your spark installation). If there is more than one entry in the jars list, use : as separator.
spark.driver.extraClassPath /path/to/my.jar
This property is documented in https://spark.apache.org/docs/1.3.1/configuration.html#runtime-environment
As far as I know, you have to import jars to both driver AND executor. So, you need to edit conf/spark-defaults.conf adding both lines below.
spark.driver.extraClassPath /path/to/my.jar
spark.executor.extraClassPath /path/to/my.jar
When I went through this, I did not need any other parameters. I guess you will not need them too.
Recommended way since Spark 2.0+ is to use
spark.driver.extraLibraryPath
and spark.executor.extraLibraryPath
https://spark.apache.org/docs/2.4.3/configuration.html#runtime-environment
ps. spark.driver.extraClassPath and spark.executor.extraClassPath are still there,
but deprecated and will be removed in a future release of Spark.

Resources