can't add alluxio.security.login.username to spark-submit - apache-spark

I have a spark driver program which I'm trying to set the alluxio user for.
I read this post: How to pass -D parameter or environment variable to Spark job? and although helpful, none of the methods in there seem to do the trick.
My environment:
- Spark-2.2
- Alluxio-1.4
- packaged jar passed to spark-submit
The spark-submit job is being run as root (under supervisor), and alluxio only recognizes this user.
Here's where I've tried adding "-Dalluxio.security.login.username=alluxio":
spark.driver.extraJavaOptions in spark-defaults.conf
on the command line for spark-submit (using --conf)
within the sparkservices conf file of my jar application
within a new file called "alluxio-site.properties" in my jar application
None of these work set the user for alluxio, though I'm easily able to set this property in a different (non-spark) client application that is also writing to alluxio.
Anyone able to make this setting apply in spark-submit jobs?

If spark-submit is in client mode, you should use --driver-java-options instead of --conf spark.driver.extraJavaOptions=... in order for the driver JVM to be started with the desired options. Therefore your command would look something like:
./bin/spark-submit ... --driver-java-options "-Dalluxio.security.login.username=alluxio" ...
This should start the driver with the desired Java options.
If the Spark executors also need the option, you can set that with:
--conf "spark.executor.extraJavaOptions=-Dalluxio.security.login.username=alluxio"

Related

spark-shell avoid typing spark.sql(""" query """)

I use spark-shell a lot and often it is to run sql queries on database. And only way to run sql queries is by wrapping them in spark.sql(""" query """).
Is there a way to switch to spark-sql directly and avoid the wrapper code? E.g. when using beeline, we get a direct sql interface.
spark-sql CLI is available with Spark package
$SPARK_HOME/bin/spark-sql
$ spark-sql
spark-sql> select 1-1;
0
Time taken: 6.368 seconds, Fetched 1 row(s)
spark-sql> select 1=1;
true
Time taken: 0.095 seconds, Fetched 1 row(s)
spark-sql>
Notes:
Spark SQL CLI cannot talk to the Thrift JDBC server
Configuration of Hive is done by placing your hive-site.xml, core-site.xml and hdfs-site.xml files in conf/
spark-sql --help
Usage: ./bin/spark-sql [options] [cli option]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application.
This argument does not work with --principal / --keytab.
--help, -h Show this help message and exit.
--verbose, -v Print additional debug output.
--version, Print the version of current Spark.
Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)
YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and the
delegation tokens periodically.
CLI options:
-d,--define <key=value> Variable subsitution to apply to hive
commands. e.g. -d A=B or --define A=B
--database <databasename> Specify the database to use
-e <quoted-query-string> SQL from command line
-f <filename> SQL from files
-H,--help Print help information
--hiveconf <property=value> Use value for given property
--hivevar <key=value> Variable subsitution to apply to hive
commands. e.g. --hivevar A=B
-i <filename> Initialization SQL file
-S,--silent Silent mode in interactive shell
-v,--verbose Verbose mode (echo executed SQL to the
console)

Submitting application on Spark Cluster using spark submit

I am new to Spark.
I want to run a Spark Structured Streaming application on cluster.
Master and workers has same configuration.
I have few queries for submitting app on cluster using spark-submit:
You may find them comical or strange.
How can I give path for 3rd party jars like lib/*? ( Application has 30+ jars)
Will Spark automatically distribute application and required jars to workers?
Does it require to host application on all the workers?
How can i know status of my application as I am working on console.
I am using following script for Spark-submit.
spark-submit
--class <class-name>
--master spark://master:7077
--deploy-mode cluster
--supervise
--conf spark.driver.extraClassPath <jar1, jar2..jarn>
--executor-memory 4G
--total-executor-cores 8
<running-jar-file>
But code is not running as per expectation.
Am i missing something?
To pass multiple jar file to Spark-submit you can set the following attributes in file SPARK_HOME_PATH/conf/spark-defaults.conf (create if not exists):
Don't forget to use * at the end of the paths
spark.driver.extraClassPath /fullpath/to/jar/folder/*
spark.executor.extraClassPath /fullpathto/jar/folder/*
Spark will set the attributes in the file spark-defaults.conf when you use the spark-submit command.
Copy your jar file on that directory and when you submit your Spark App on the cluster, the jar files in the specified paths will be loaded, too.
spark.driver.extraClassPath: Extra classpath entries to prepend
to the classpath of the driver. Note: In client mode, this config
must not be set through the SparkConf directly in your application,
because the driver JVM has already started at that point. Instead,
please set this through the --driver-class-path command line option or
in your default properties file.
--jars will transfer your jar files to worker nodes, and become available in both driver and executors' classpaths.
Please refer below link to see more details.
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
You can make a fat jar containing all dependencies. Below link helps you understand that.
https://community.hortonworks.com/articles/43886/creating-fat-jars-for-spark-kafka-streaming-using.html

Running Spark Job on Zeppelin

I have written a custom spark library in scala. I am able to run this successfully as a spark-submit step by spawning the cluster and running the following commands. Here I first get my 2 jars by -
aws s3 cp s3://jars/RedshiftJDBC42-1.2.10.1009.jar .
aws s3 cp s3://jars/CustomJar .
and then i run my spark job as
spark-submit --deploy-mode client --jars RedshiftJDBC42-1.2.10.1009.jar --packages com.databricks:spark-redshift_2.11:3.0.0-preview1,com.databricks:spark-avro_2.11:3.2.0 --class com.activities.CustomObject CustomJar.jar
This runs my CustomObject successfully. I want to run the similar thing in Zeppelin, But I do not know how to add jars and then run a spark-submit step?
You can add these dependencies to the Spark interpreter within Zeppelin:
Go to "Interpreter"
Choose edit and add the jar file
Restart the interpreter
More info here
EDIT
You might also want to use the %dep paragraph in order to access the zvariable (which is an implicit Zeppeling context) in order to do something like this:
%dep
z.load("/some_absolute_path/myjar.jar")
It depend how you run Spark. Most of the time, the Zeppelin interpreter will embed the Spark driver.
The solution is to configure the Zeppelin interpreter instead:
ZEPPELIN_INTP_JAVA_OPTS will configure java options
SPARK_SUBMIT_OPTIONS will configure spark options

What is the use of --driver-class-path in the spark command?

As per spark docs,
To get started you will need to include the JDBC driver for you particular database on the spark classpath. For example, to connect to postgres from the Spark Shell you would run the following command:
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
Job is working fine without --driver-class-path. Then, what is the use of --driver-class-path in the spark command?
--driver-class-path or spark.driver.extraClassPath can be used for to modify class path only for the Spark driver. This is useful for libraries which are not required by the executors (for example any code that is used only locally).
Compared to that, --jars or spark.jars will not only add jars to both driver and executor classpath, but also distribute archives over the cluster. If particular jar is used only by the driver this is unnecessary overhead.
Let's say we run the following command with Spark 3.3.0:
spark-submit --driver-class-path DCP.jar --jars JARS.jar MAIN.jar
What the scripts will actually execute is:
java
-cp DCP.jar:spark/conf:spark/jars/*
org.apache.spark.deploy.SparkSubmit
--conf spark.driver.extraClassPath=DCP.jar
--jars JARS.jar
MAIN.jar
(I've removed the irrelevant bits.)
The surprise (for me) is that only DCP.jar is on the classpath. Neither JARS.jar nor MAIN.jar are on the JVM classpath. This means any JDBC driver registration from those jars will not be activated. You need to put the JDBC jar on --driver-class-path.
But you also want the workers to be able to do JDBC. So you need to put the JDBC jar on --jars too. Both are required, like the documentation says.

How to config spark.io.compression.codec=lzf in Spark

How to config spark.io.compression.codec=lzf in Spark?
Usually, I use spark-submit to run our driver class like below
./spark-submit --master spark://testserver:7077 --class
com.spark.test.SparkTest --conf "spark.io.compression.codec=lzf"
/tmp/test/target/test.jar.
So I can set spark.io.compression.codec=lzf in the command. But if I don't want to use spark-submit to run our driver class. I want to run in a spark-job-server. How to config in spark-job-server ?thanks
I tried to set it in env variables. But it doesn't work. I also tried below. Still not work.
sparkConf = new SparkConf().setMaster("spark://testserver:7077").setAppName("Javasparksqltest").
set("spark.executor.memory", "8g").set("spark.io.compression.codec", "lzf");
You can pass that option to spark-submit, or spark-shell by putting it in the conf/spark-defaults.conf associated to it. The details are in the configuration section of the doc.
For the spark-jobserver, you configure a given context, especially if it is being sent as a context implicitly created from a job. There are several ways to do so (the gist of it being that settings are hierarchized under spark.context-settings), but the "Context configuration" of the Readme.md details how to do it:
https://github.com/spark-jobserver/spark-jobserver/blob/master/README.md
Use complete class name "org.apache.spark.io.LZFCompressionCodec" instead of "lzf"

Resources