add file to spark driver classpath file on dataproc - apache-spark

I need to add a config file to driver spark classpath on google dataproc.
I have try to use --files option of gcloud dataproc jobs submit spark but this not work.
Is there a way to do it on google dataproc?

In Dataproc, anything listed as a --jar will be added to the classpath and anything listed as a --file will be made available in each spark executor's working directory. Even though the flag is --jars, it should be safe to put non-jar entries in this list if you require the file to be on the classpath.

I know, I am answering too late. Posting for new visitors.
One can execute this using cloud shell. Have tested this.
gcloud dataproc jobs submit spark --properties spark.dynamicAllocation.enabled=false --cluster=<cluster_name> --class com.test.PropertiesFileAccess --region=<CLUSTER_REGION> --files gs://<BUCKET>/prod.predleads.properties --jars gs://<BUCKET>/snowflake-common-3.1.34.jar

Related

Submitting application on Spark Cluster using spark submit

I am new to Spark.
I want to run a Spark Structured Streaming application on cluster.
Master and workers has same configuration.
I have few queries for submitting app on cluster using spark-submit:
You may find them comical or strange.
How can I give path for 3rd party jars like lib/*? ( Application has 30+ jars)
Will Spark automatically distribute application and required jars to workers?
Does it require to host application on all the workers?
How can i know status of my application as I am working on console.
I am using following script for Spark-submit.
spark-submit
--class <class-name>
--master spark://master:7077
--deploy-mode cluster
--supervise
--conf spark.driver.extraClassPath <jar1, jar2..jarn>
--executor-memory 4G
--total-executor-cores 8
<running-jar-file>
But code is not running as per expectation.
Am i missing something?
To pass multiple jar file to Spark-submit you can set the following attributes in file SPARK_HOME_PATH/conf/spark-defaults.conf (create if not exists):
Don't forget to use * at the end of the paths
spark.driver.extraClassPath /fullpath/to/jar/folder/*
spark.executor.extraClassPath /fullpathto/jar/folder/*
Spark will set the attributes in the file spark-defaults.conf when you use the spark-submit command.
Copy your jar file on that directory and when you submit your Spark App on the cluster, the jar files in the specified paths will be loaded, too.
spark.driver.extraClassPath: Extra classpath entries to prepend
to the classpath of the driver. Note: In client mode, this config
must not be set through the SparkConf directly in your application,
because the driver JVM has already started at that point. Instead,
please set this through the --driver-class-path command line option or
in your default properties file.
--jars will transfer your jar files to worker nodes, and become available in both driver and executors' classpaths.
Please refer below link to see more details.
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
You can make a fat jar containing all dependencies. Below link helps you understand that.
https://community.hortonworks.com/articles/43886/creating-fat-jars-for-spark-kafka-streaming-using.html

Running Spark Job on Zeppelin

I have written a custom spark library in scala. I am able to run this successfully as a spark-submit step by spawning the cluster and running the following commands. Here I first get my 2 jars by -
aws s3 cp s3://jars/RedshiftJDBC42-1.2.10.1009.jar .
aws s3 cp s3://jars/CustomJar .
and then i run my spark job as
spark-submit --deploy-mode client --jars RedshiftJDBC42-1.2.10.1009.jar --packages com.databricks:spark-redshift_2.11:3.0.0-preview1,com.databricks:spark-avro_2.11:3.2.0 --class com.activities.CustomObject CustomJar.jar
This runs my CustomObject successfully. I want to run the similar thing in Zeppelin, But I do not know how to add jars and then run a spark-submit step?
You can add these dependencies to the Spark interpreter within Zeppelin:
Go to "Interpreter"
Choose edit and add the jar file
Restart the interpreter
More info here
EDIT
You might also want to use the %dep paragraph in order to access the zvariable (which is an implicit Zeppeling context) in order to do something like this:
%dep
z.load("/some_absolute_path/myjar.jar")
It depend how you run Spark. Most of the time, the Zeppelin interpreter will embed the Spark driver.
The solution is to configure the Zeppelin interpreter instead:
ZEPPELIN_INTP_JAVA_OPTS will configure java options
SPARK_SUBMIT_OPTIONS will configure spark options

adding external property file to classpath in spark

I am currently submitting my fat jar to spark cluster using below command.
Application fat jar and related configuration are in the folder /home/myapplication
$SPARK_HOME/bin/spark-submit --jars $SPARK_HOME/lib/protobuf-java-2.5.0.jar --class MainClass /home/myapplication/my-application-fat.jar -appconf /home/myapplication/application-prop.properties -conf /home/myapplication/application-configuration.conf
Now my requirement is to add an external property file /home/myapplication/external-prop.properties to classpath of both driver and worker node.
I searched lot of resources but could not get right solution i am looking for!
Please help in resolving the issue. Thanks in advance
your requirement lies in using spark.executor.extraClassPath configuration to point to the properties file. But before that as #philantrovert has pointed out to use --files option to copy the property file to the worker nodes.
So your correct command should be
$SPARK_HOME/bin/spark-submit --jars $SPARK_HOME/lib/protobuf-java-2.5.0.jar --class MainClass /home/myapplication/my-application-fat.jar -appconf /home/myapplication/application-prop.properties -conf /home/myapplication/application-configuration.conf --files /home/myapplication/external-prop.properties --conf "spark.executor.extraClassPath=./"

can't add alluxio.security.login.username to spark-submit

I have a spark driver program which I'm trying to set the alluxio user for.
I read this post: How to pass -D parameter or environment variable to Spark job? and although helpful, none of the methods in there seem to do the trick.
My environment:
- Spark-2.2
- Alluxio-1.4
- packaged jar passed to spark-submit
The spark-submit job is being run as root (under supervisor), and alluxio only recognizes this user.
Here's where I've tried adding "-Dalluxio.security.login.username=alluxio":
spark.driver.extraJavaOptions in spark-defaults.conf
on the command line for spark-submit (using --conf)
within the sparkservices conf file of my jar application
within a new file called "alluxio-site.properties" in my jar application
None of these work set the user for alluxio, though I'm easily able to set this property in a different (non-spark) client application that is also writing to alluxio.
Anyone able to make this setting apply in spark-submit jobs?
If spark-submit is in client mode, you should use --driver-java-options instead of --conf spark.driver.extraJavaOptions=... in order for the driver JVM to be started with the desired options. Therefore your command would look something like:
./bin/spark-submit ... --driver-java-options "-Dalluxio.security.login.username=alluxio" ...
This should start the driver with the desired Java options.
If the Spark executors also need the option, you can set that with:
--conf "spark.executor.extraJavaOptions=-Dalluxio.security.login.username=alluxio"

What is the use of --driver-class-path in the spark command?

As per spark docs,
To get started you will need to include the JDBC driver for you particular database on the spark classpath. For example, to connect to postgres from the Spark Shell you would run the following command:
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
Job is working fine without --driver-class-path. Then, what is the use of --driver-class-path in the spark command?
--driver-class-path or spark.driver.extraClassPath can be used for to modify class path only for the Spark driver. This is useful for libraries which are not required by the executors (for example any code that is used only locally).
Compared to that, --jars or spark.jars will not only add jars to both driver and executor classpath, but also distribute archives over the cluster. If particular jar is used only by the driver this is unnecessary overhead.
Let's say we run the following command with Spark 3.3.0:
spark-submit --driver-class-path DCP.jar --jars JARS.jar MAIN.jar
What the scripts will actually execute is:
java
-cp DCP.jar:spark/conf:spark/jars/*
org.apache.spark.deploy.SparkSubmit
--conf spark.driver.extraClassPath=DCP.jar
--jars JARS.jar
MAIN.jar
(I've removed the irrelevant bits.)
The surprise (for me) is that only DCP.jar is on the classpath. Neither JARS.jar nor MAIN.jar are on the JVM classpath. This means any JDBC driver registration from those jars will not be activated. You need to put the JDBC jar on --driver-class-path.
But you also want the workers to be able to do JDBC. So you need to put the JDBC jar on --jars too. Both are required, like the documentation says.

Resources