replace default application.conf file in spark-submit - apache-spark

My code works like:
val config = ConfigFactory.load
It gets the key-value pairs from application.conf by default. Then I use -Dconfig.file= to point to another conf file.
It works fine for command below:
dse -u cassandra -p cassandra spark-submit
--class packagename.classname --driver-java-options
-Dconfig.file=/home/userconfig.conf /home/user-jar-with-dependencies.jar
But now I need to split the userconfig.conf to 2 files. I tried command below. It doesn't work.
dse -u cassandra -p cassandra spark-submit
--class packagename.classname --driver-java-options
-Dconfig.file=/home/userconfig.conf,env.conf
/home/user-jar-with-dependencies.jar

By default spark will look in defaults.conf but you can 1) specify another file using 'properties-file' 2) you can pass individual keu value properties using --conf or 3) you can set up the configuration programmatically in your code using the sparkConf object
Does this help or are you looking for the akka application.conf file?

Related

Passing multiple typesafe config files to a yarn cluster mode application

I'm struggling a bit on trying to use multiple (via include) Typesafe config files in my Spark Application that I am submitting to a YARN queue in cluster mode. I basically have two config files and file layouts are provided below:
env-common.properties
application-txn.conf (this file uses an "include" to reference the above one)
Both of the above files are external to my application.jar, so I pass them to yarn using the "--files" (can be seen below)
I am using the Typesafe config library to parse my "application-main.conf" and in this main conf, I am trying to use a property from the env.properties file via substitution, but the variable name does not get resolved :( and I'm not sure why.
env.properties
txn.hdfs.fs.home=hdfs://dev/1234/data
application-txn.conf:
# application-txn.conf
include required(file("env.properties"))
app {
raw-data-location = "${txn.hdfs.fs.home}/input/txn-raw"
}
Spark Application Code:
//propFile in the below block maps to "application-txn.conf" from the app's main method
def main {
val config = loadConfig("application-txn.conf")
val spark = SparkSession.builkder.getOrCreate()
//Code fails here:
val inputDF = spark.read.parquet(config.getString("app.raw-data-location"))
}
def loadConf(propFile:String): Config = {
ConfigFactory.load()
val cnf = ConfigFactory.parseResources(propFile)
cnf.resolve()
}
Spark Submit Code (called from a shell script):
spark-submit --class com.nic.cage.app.Transaction \
--master yarn \
--queue QUEUE_1 \
--deploy-mode cluster \
--name MyTestApp \
--files application-txn.conf,env.properties \
--jars #Typesafe config 1.3.3 and my app.jar go here \
--executor-memory 2g \
--executor-cores 2 \
app.jar application-txn.conf
When I run the above, I am able to parse the config file, but my app fails on trying to read the files from HDFS because it cannot find a directory with the name:
${txn.hdfs.fs.home}/input/txn-raw
I believe that the config is actually able to read both files...or else it would fail because of the "required" keyword. I verified this by adding another include statement with a dummy file name, and the application failed on parsing of the config. Really not sure what's going on right now :(.
Any ideas what could be causing this resolution to fail?
If it helps: When I run locally with multiple config files, the resolution works fine
The syntax in application-txn.conf is wrong.
The variable should be outside the string, like so:
raw-data-location = ${txn.hdfs.fs.home}"/input/txn-raw"

Spark read csv file submitted from --files

I'm submitting a Spark job to a remote spark cluster on yarn and including a file in the spark-submit --file I want to read the submitted file as a dataframe. But I'm confused about how to go about this without having to put the file in HDFS:
spark-submit \
--class com.Employee \
--master yarn \
--files /User/employee.csv \
--jars SomeJar.jar
spark: SparkSession = // create the Spark Session
val df = spark.read.csv("/User/employee.csv")
spark.sparkContext.addFile("file:///your local file path ")
Add file using addFile so that it can be available at your worker nodes. Since you want to read local file in cluster mode.
You may need to do a slight change according to scala and the spark version your are using.
employee.csv is in the work directory of executor, just reading it as follows:
val df = spark.read.csv("employee.csv")

Spark-shell vs Spark-submit adding jar to classpath issue

I'm able to run CREATE TEMPORARY FUNCTION testFunc using jar 'myJar.jar' query in hiveContext via spark-shell --jars myJar.jar -i some_script.scala, but I'm not able to run such command via spark-submit --class com.my.DriverClass --jars myJar.jar target.jar.
Am I doing something wrong?
If you are using local file system, the Jar must be in the same location on all nodes.
So you have 2 options:
place jar on all nodes in the same directory, for example in /home/spark/my.jar and then use this directory in --jars option.
use distributed file system like HDFS

How to use --num-executors option with spark-submit?

I am trying to override spark properties such as num-executors while submitting the application by spark-submit as below :
spark-submit --class WC.WordCount \
--num-executors 8 \
--executor-cores 5 \
--executor-memory 3584M \
...../<myjar>.jar \
/public/blahblahblah /user/blahblah
However its running with default number of executors which is 2. But I am able to override properties if I add
--master yarn
Can someone explain why it is so ? Interestingly , in my application code I am setting master as yarn-client:
val conf = new SparkConf()
.setAppName("wordcount")
.setMaster("yarn-client")
.set("spark.ui.port","56487")
val sc = new SparkContext(conf)
Can someone throw some light as to how the option --master works
I am trying to override spark properties such as num-executors while submitting the application by spark-submit as below
It will not work (unless you override spark.master in conf/spark-defaults.conf file or similar so you don't have to specify it explicitly on the command line).
The reason is that the default Spark master is local[*] and the number of executors is exactly one, i.e. the driver. That's just the local deployment environment. See Master URLs.
As a matter of fact, num-executors is very YARN-dependent as you can see in the help:
$ ./bin/spark-submit --help
...
YARN-only:
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
That explains why it worked when you switched to YARN. It is supposed to work with YARN (regardless of the deploy mode, i.e. client or cluster which is about the driver alone not executors).
You may be wondering why it did not work with the master defined in your code then. The reason is that it is too late since the master has already been assigned on launch time when you started the application using spark-submit. That's exactly the reason why you should not specify deployment environment-specific properties in the code as:
It may not always work (see the case with master)
It requires that a code has to be recompiled every configuration change (and makes it a bit unwieldy)
That's why you should be always using spark-submit to submit your Spark applications (unless you've got reasons not to, but then you'd know why and could explain it with ease).
If you’d like to run the same application with different masters or different amounts of memory. Spark allows you to do that with an default SparkConf. As you are mentioning properties to SparkConf, those takes highest precedence for application, Check the properties precedence at the end.
Example:
val sc = new SparkContext(new SparkConf())
Then, you can supply configuration values at runtime:
./bin/spark-submit \
--name "My app" \
--deploy-mode "client" \
--conf spark.ui.port=56487 \
--conf spark.master=yarn \ #alternate to --master
--conf spark.executor.memory=4g \ #alternate to --executor-memory
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
--class WC.WordCount \
/<myjar>.jar \
/public/blahblahblah \
/user/blahblah
Properties precedence order (top one is more)
Properties set directly on the SparkConf(in the code) take highest
precedence.
Any values specified as flags or in the properties file will be passed
on to the application and merged with those specified through
SparkConf.
then flags passed to spark-submit or spark-shell like --master etc
then options in the spark-defaults.conf file.
A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
Source: Dynamically Loading Spark Properties

can't add alluxio.security.login.username to spark-submit

I have a spark driver program which I'm trying to set the alluxio user for.
I read this post: How to pass -D parameter or environment variable to Spark job? and although helpful, none of the methods in there seem to do the trick.
My environment:
- Spark-2.2
- Alluxio-1.4
- packaged jar passed to spark-submit
The spark-submit job is being run as root (under supervisor), and alluxio only recognizes this user.
Here's where I've tried adding "-Dalluxio.security.login.username=alluxio":
spark.driver.extraJavaOptions in spark-defaults.conf
on the command line for spark-submit (using --conf)
within the sparkservices conf file of my jar application
within a new file called "alluxio-site.properties" in my jar application
None of these work set the user for alluxio, though I'm easily able to set this property in a different (non-spark) client application that is also writing to alluxio.
Anyone able to make this setting apply in spark-submit jobs?
If spark-submit is in client mode, you should use --driver-java-options instead of --conf spark.driver.extraJavaOptions=... in order for the driver JVM to be started with the desired options. Therefore your command would look something like:
./bin/spark-submit ... --driver-java-options "-Dalluxio.security.login.username=alluxio" ...
This should start the driver with the desired Java options.
If the Spark executors also need the option, you can set that with:
--conf "spark.executor.extraJavaOptions=-Dalluxio.security.login.username=alluxio"

Resources