Running spark application with two different java versions - apache-spark

Is there any way I can run the Spark3 application with two different java versions for example java8 and java11?
I have tried the following command but it is picking the default java version i.e java8.
spark-shell \
--conf spark.yarn.appMasterEnv.JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.17.0.8-2.el7_9.x86_64/ \
--conf spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.17.0.8-2.el7_9.x86_64/

Related

How to initialise PySpark on AWS Cloud9

I want to initialise pyspark version 3.3.1 on aws cloud9 and to read a s3 file path from AWS. But when I run the code, I got this error shown in the image.
I was thinking that there is something wrong with my Pyspark initilisation, and I have tried the code below provided by my colleague but apparently this doesn't work for me. enter image description here
My pyspark version is 3.3.1 and hadoop version 3
pkg_list=org.apache.spark:spark-avro_2.11:2.4.4,org.apache.hadoop:hadoop-aws:2.7.1
pyspark --packages $pkg_list --driver-memory 32G --driver-cores 8 --num-executors 8 --executor-memory 32G --executor-cores 8 --driver-java-options="-Djava.io.tmpdir=/home/yoongkiat/tempfiles"
The error is saying that in some hadoop config file or option that Spark is using, you have a string 64M, but it's only expecting a number.
The error doesn't say which file, and that's not a value you've provided on the command line, so you'll need to debug the installation on your own. As mentioned in comments, AWS EMR already offers a functional Spark environment.
By the, you cannot use dependencies from different Spark versions; you're running 3.3.1, but trying to add spark-avro for 2.4.4. I'm also not certain you'll need to add hadoop-aws since Spark should have those libraries included out of the box.

spark-submit passing python files in zip does not work

I am trying to submit a Python Application using spark-submit, like so:
spark-submit \
--conf spark.submit.pyFiles=path/to/archive.zip \
--conf spark.app.name=Test123 \
--conf spark.master=local[2] \
--conf spark.driver.memory=5G \
path/to/python_app.py
The python_app.py tries to import modules from archive.zip, but it fails with an ModuleNotFoundError. If I substitute
--conf spark.submit.pyFiles=path/to/archive.zip
with
--py-files path/to/archive.zip
it works as expected. It is really weird because setting master, driver memory and app name works using --conf.
What am I missing here? Thanks!
Edit 2018-07-06:
I tried this with Spark versions 2.1.3, 2.2.0 and 2.3.1 - the problem is the same for all three versions. And: I have the problem regardless of submitting to local[x] or yarn.
I had the same problem recently. I believe naming might be misleading here.
setting spark.submit.pyFiles states only that you want to add them to PYTHONPATH. But apart of that you need to upload those files to all your executors working directory. You can do that with spark.files
For me it does the job. I am setting those values in spark-defauls.conf

Is it possible to broadcast Tensorflow libraries using spark-submit -package

I am using a cluster, which is not managed by myself. Tensorflow libraries are not installed on any cluster nodes. But I would like to run some Spark programs using tensorflow package. I am not sure if it is possible to simply use spark-submit --packages to broadcast tensorflow packages across the cluster nodes.
I am not sure about Tensorflow itself, but you can pass a local jars using --jars and files using --files to the job. Below is an example:
spark-submit --master yarn-cluster --num-executors 5 --driver-memory 640m --executor-memory 640m --conf spark.yarn.maxAppAttempts=1000 \
--jars /usr/hdp/current/spark-client-1.6.1/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client-1.6.1/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-client-1.6.1/lib/datanucleus-rdbms-3.2.9.jar \
--files /usr/hdp/current/spark-client-1.6.1/conf/hive-site.xml \
--class com.foobar.main
This is an example of how I start spark streaming job and the Application Master and Driver run on the cluster where spark is not installed. So I need to pass a long some jars and configs for it to run.

Error: Unrecognized option: --packages

I'm porting an existing script from BigInsights to Spark on Bluemix. I'm trying to run the following against Spark on Bluemix:
./spark-submit.sh --vcap ./vcap.json --deploy-mode cluster \
--master https://x.x.x.x:8443 --jars ./truststore.jar \
--packages org.elasticsearch:elasticsearch-spark_2.10:2.3.0 \
./export_to_elasticsearch.py ...
However, I get the following error:
Error: Unrecognized option: --packages
How can I pass the --packages parameter?
Bluemix uses a customized Spark version, with a customized spark-submit.sh script that only supports a subset of the original script parameters. You can see all the configuration properties and parameters you can use on its documentation.
Additionally, you can download the Bluemix version of the script from this link, and there you can see that there is no argument --packages.
Therefore, the problem with your approach is that the Bluemix version of spark-submit does not accept the --packages parameter, probably due to security reasons. However, alternatively, you can download the jar for the package you want (and maybe a fat jar for the dependencies) and upload them using the --jars parameter. Note: To avoid the necessity of uploading the jar files each time you call spark-submit, you can pre-upload them using curl. The details of this procedure can be found on this link.
Adding to Daniel's post, while using the method to pre-upload your package, you might want to upload your package to "${cluster_master_url}/tenant/data/libs", since Spark service sets these four spark properties "spark.driver.extraClassPath", "spark.driver.extraLibraryPath", "spark.executor.extraClassPath", and "spark.executor.extraLibraryPath" to ./data/libs/*
Reference: https://console.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index-gentopic3.html#spark-submit_properties

Overwrite Spark version in depencency

In this presentation they show an example of "upgrading" Spark version just by passing newer spark-assembly.jar as a dependency. Here's relevant snippet ("upgrading" from Spark 0.9 to 1.1):
export SPARK_JAR=/usr/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.2.0.jar
java -cp /etc/hadoop/conf:AppJar.jar:spark-assembly.jar org.apache.spark.deploy.yarn.Client --jar AppJar.jar --addJars /jars/config.jar --class ooyala.app.MainClass --arg arg1 --arg arg2 --name MyApp
This is very nice possibility since it allows to use latest features without the need to upgrade the whole cluster very often. However, code above is totally outdated now, so I tried to use something similar with spark-submit (trying to add jar of Spark 1.5 to a cluster running Spark 1.2):
~/spark-1.5/bin/spark-submit \
--jars ~/spark-assembly-1.5.1-hadoop2.4.0.jar \
--class ooyala.app.MainClass
--master yarn-client
ooyala-test_2.10-1.0.jar
But it doesn't work either, resulting in NullPointerException deep in Spark internals.
Does anyone have experience doing this trick on recent versions of Spark?

Resources