Overwrite Spark version in depencency - apache-spark

In this presentation they show an example of "upgrading" Spark version just by passing newer spark-assembly.jar as a dependency. Here's relevant snippet ("upgrading" from Spark 0.9 to 1.1):
export SPARK_JAR=/usr/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.2.0.jar
java -cp /etc/hadoop/conf:AppJar.jar:spark-assembly.jar org.apache.spark.deploy.yarn.Client --jar AppJar.jar --addJars /jars/config.jar --class ooyala.app.MainClass --arg arg1 --arg arg2 --name MyApp
This is very nice possibility since it allows to use latest features without the need to upgrade the whole cluster very often. However, code above is totally outdated now, so I tried to use something similar with spark-submit (trying to add jar of Spark 1.5 to a cluster running Spark 1.2):
~/spark-1.5/bin/spark-submit \
--jars ~/spark-assembly-1.5.1-hadoop2.4.0.jar \
--class ooyala.app.MainClass
--master yarn-client
ooyala-test_2.10-1.0.jar
But it doesn't work either, resulting in NullPointerException deep in Spark internals.
Does anyone have experience doing this trick on recent versions of Spark?

Related

Running spark application with two different java versions

Is there any way I can run the Spark3 application with two different java versions for example java8 and java11?
I have tried the following command but it is picking the default java version i.e java8.
spark-shell \
--conf spark.yarn.appMasterEnv.JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.17.0.8-2.el7_9.x86_64/ \
--conf spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.17.0.8-2.el7_9.x86_64/

Is there a way to use PySpark with Hadoop 2.8+?

I would like to run a PySpark job locally, using a specific version of Hadoop (let's say hadoop-aws 2.8.5) because of some features.
PySpark versions seem to be aligned with Spark versions.
Here I use PySpark 2.4.5 which seems to wrap a Spark 2.4.5.
When submitting my PySpark Job, using spark-submit --local[4] ..., with the option --conf spark.jars.packages=org.apache.hadoop:hadoop-aws:2.8.5, I encounter the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o32.sql
With the following java exceptions:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
Or:
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init (Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
I suppose that the Pyspark Job Hadoop version is unaligned with the one I pass to the spark-submit option spark.jars.packages.
But I have not any idea of how I could make it work? :)
Default spark disto has hadoop libraries included. Spark use system (its own) libraries first. So you should either set --conf spark.driver.userClassPathFirst=true and for cluster add --conf spark.executor.userClassPathFirst=true or download spark distro without hadoop. Probably you will have to put your hadoop distro into spark disto jars directory.
Ok, I found a solution:
1 - Install Hadoop in the expected version (2.8.5 for me)
2 - Install a Hadoop Free version of Spark (2.4.4 for me)
3 - Set SPARK_DIST_CLASSPATH environment variable, to make Spark uses the custom version of Hadoop.
(cf. https://spark.apache.org/docs/2.4.4/hadoop-provided.html)
4 - Add the PySpark directories to PYTHONPATH environment variable, like the following:
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
(Note that the py4j version my differs)
That's it.

spark-submit passing python files in zip does not work

I am trying to submit a Python Application using spark-submit, like so:
spark-submit \
--conf spark.submit.pyFiles=path/to/archive.zip \
--conf spark.app.name=Test123 \
--conf spark.master=local[2] \
--conf spark.driver.memory=5G \
path/to/python_app.py
The python_app.py tries to import modules from archive.zip, but it fails with an ModuleNotFoundError. If I substitute
--conf spark.submit.pyFiles=path/to/archive.zip
with
--py-files path/to/archive.zip
it works as expected. It is really weird because setting master, driver memory and app name works using --conf.
What am I missing here? Thanks!
Edit 2018-07-06:
I tried this with Spark versions 2.1.3, 2.2.0 and 2.3.1 - the problem is the same for all three versions. And: I have the problem regardless of submitting to local[x] or yarn.
I had the same problem recently. I believe naming might be misleading here.
setting spark.submit.pyFiles states only that you want to add them to PYTHONPATH. But apart of that you need to upload those files to all your executors working directory. You can do that with spark.files
For me it does the job. I am setting those values in spark-defauls.conf

Is it possible to broadcast Tensorflow libraries using spark-submit -package

I am using a cluster, which is not managed by myself. Tensorflow libraries are not installed on any cluster nodes. But I would like to run some Spark programs using tensorflow package. I am not sure if it is possible to simply use spark-submit --packages to broadcast tensorflow packages across the cluster nodes.
I am not sure about Tensorflow itself, but you can pass a local jars using --jars and files using --files to the job. Below is an example:
spark-submit --master yarn-cluster --num-executors 5 --driver-memory 640m --executor-memory 640m --conf spark.yarn.maxAppAttempts=1000 \
--jars /usr/hdp/current/spark-client-1.6.1/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client-1.6.1/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-client-1.6.1/lib/datanucleus-rdbms-3.2.9.jar \
--files /usr/hdp/current/spark-client-1.6.1/conf/hive-site.xml \
--class com.foobar.main
This is an example of how I start spark streaming job and the Application Master and Driver run on the cluster where spark is not installed. So I need to pass a long some jars and configs for it to run.

Error: Unrecognized option: --packages

I'm porting an existing script from BigInsights to Spark on Bluemix. I'm trying to run the following against Spark on Bluemix:
./spark-submit.sh --vcap ./vcap.json --deploy-mode cluster \
--master https://x.x.x.x:8443 --jars ./truststore.jar \
--packages org.elasticsearch:elasticsearch-spark_2.10:2.3.0 \
./export_to_elasticsearch.py ...
However, I get the following error:
Error: Unrecognized option: --packages
How can I pass the --packages parameter?
Bluemix uses a customized Spark version, with a customized spark-submit.sh script that only supports a subset of the original script parameters. You can see all the configuration properties and parameters you can use on its documentation.
Additionally, you can download the Bluemix version of the script from this link, and there you can see that there is no argument --packages.
Therefore, the problem with your approach is that the Bluemix version of spark-submit does not accept the --packages parameter, probably due to security reasons. However, alternatively, you can download the jar for the package you want (and maybe a fat jar for the dependencies) and upload them using the --jars parameter. Note: To avoid the necessity of uploading the jar files each time you call spark-submit, you can pre-upload them using curl. The details of this procedure can be found on this link.
Adding to Daniel's post, while using the method to pre-upload your package, you might want to upload your package to "${cluster_master_url}/tenant/data/libs", since Spark service sets these four spark properties "spark.driver.extraClassPath", "spark.driver.extraLibraryPath", "spark.executor.extraClassPath", and "spark.executor.extraLibraryPath" to ./data/libs/*
Reference: https://console.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index-gentopic3.html#spark-submit_properties

Resources