How to access external property file in spark-submit job? - apache-spark

I am using spark 2.4.1 version and java8.
I am trying to load external property file while submitting my spark job using spark-submit.
As I am using below TypeSafe to load my property file.
<groupId>com.typesafe</groupId>
<artifactId>config</artifactId>
<version>1.3.1</version>
In my code I am using
public static Config loadEnvProperties(String environment) {
Config appConf = ConfigFactory.load(); // loads my "resouces" folder "application.properties" file
return appConf.getConfig(environment);
}
To externalize this "application.properties" file I tried this as suggested by an expert while spark-submit as below
spark-submit \
--master yarn \
--deploy-mode cluster \
--name Extractor \
--jars "/local/apps/jars/*.jar" \
--files /local/apps/log4j.properties \
--files /local/apps/applicationNew.properties \
--class Driver \
--conf spark.driver.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
--conf spark.executor.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
--conf spark.driver.extraJavaOptions=-Dlog4j.debug \
--conf spark.driver.extraClassPath=. \
migration-0.0.1.jar sit
I placed "log4j.properties" & "applicationNew.properties" files same folder where I am running my spark-submit.
1) In the above shell script if I keep
--files /local/apps/log4j.properties, /local/apps/applicationNew.properties \
Error :
Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/local/apps//applicationNew.properties
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
So what is wrong here ?
2) Then i changed above script like shown i.e.
--files /local/apps/log4j.properties \
--files /local/apps/applicationNew.properties \
when I run spark job then I will get following error.
19/08/02 14:19:09 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'sit'
at com.typesafe.config.impl.SimpleConfig.findKeyOrNull(SimpleConfig.java:152)
So what is wrong here ? why not loading the applicationNew.properties file ?
3) When I debugged it as below
i.e. printed "config.file"
String ss = System.getProperty("config.file");
logger.error ("config.file : {}" , ss);
Error :
19/08/02 14:19:09 ERROR Driver: config.file : null
19/08/02 14:19:09 ERROR yarn.ApplicationMaster: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'sit'
So how to set "config.file" option from spark-submit ?
How to fix above errors and load properties from external applicationNew.properties file ?

The proper way to list files for the --files, --jars and other similar arguments is via a comma without any spaces (this is a crucial thing, and you see the exception about invalid main class precisely because of this):
--files /local/apps/log4j.properties,/local/apps/applicationNew.properties
If file names themselves have spaces in it, you should use quotes to escape these spaces:
--files "/some/path with/spaces.properties,/another path with/spaces.properties"
Another issue is that you specify the same property twice:
...
--conf spark.driver.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
...
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
...
There is no way for spark-submit to know how to merge these values, therefore only one of them is used. This is the reason why you see null for the config.file system property: it's just the second --conf argument takes priority and overrides the extraJavaOptions property with a single path to the log4j config file. Thus, the correct way is to specify all these values as one property:
--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:./log4j.properties -Dconfig.file=./applicationNew.properties"
Note that because of quotes, the entire spark.driver.extraJavaOptions="..." is one command line argument rather than several, which is very important for spark-submit to pass these arguments to the driver/executor JVM correctly.
(I also changed the log4j.properties file to use a proper URI instead of a file. I recall that without this path being a URI it might not work, but you can try either way and check for sure.)

--files and SparkFiles.get
With --files you should access the resource using SparkFiles.get as follows:
$ ./bin/spark-shell --files README.md
scala> import org.apache.spark._
import org.apache.spark._
scala> SparkFiles.get("README.md")
res0: String = /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8w0000gn/T/spark-f0b16df1-fba6-4462-b956-fc14ee6c675a/userFiles-eef6d900-cd79-4364-a4a2-dd177b4841d2/README.md
In other words, Spark will distribute the --files to executors, but the only way to know the path of the files is to use SparkFiles utility.
getResourceAsStream(resourceFile) and InputStream
The other option would be to package all resource files into a jar file and bundle it together with the other jar files (either as a single uber-jar or simply as part of CLASSPATH of the Spark app) and use the following trick:
this.getClass.getClassLoader.getResourceAsStream(resourceFile)
With that, regardless of the jar file the resourceFile is in, as long as it's on the CLASSPATH, it should be available to the application.
I'm pretty sure any decent framework or library that uses resource files for configuration, e.g. Typesafe Config, accepts InputStream as the way to read resource files.
You could also include the --files as part of a jar file that is part of the CLASSPATH of the executors, but that'd be obviously less flexible (as every time you'd like to submit your Spark app with a different file, you'd have to recreate the jar).

Related

Spark external jars and files on hdfs

I have a spark job that I run using the spark-submit command.
The jar that I use is hosted on hdfs and I call it from there directly in the spark-submit query using its hdfs file path.
With this same logic, I'm trying to do the same when for the --jars options, the files options and also the extraClassPath option (in the spark.conf) but it seems that there is an issue with the fact that it point to a hdfs file path.
My command looks like this:
spark-submit \
--class Main \
--jars 'hdfs://path/externalLib.jar' \
--files 'hdfs://path/log4j.xml' \
--properties-file './spark.conf' \
'hdfs://path/job_name.jar
So not only when I call a method that refers the externalLib.jar, spark raises an exception telling me that it doesn't find the method but also from the starts I have the warning logs:
Source and destination file systems are the same. Not copying externalLib.jar
Source and destination file systems are the same. Not copying log4j.xml
It must come from the fact that I precise a hdfs path because it works flawlessly when I refers to those jar in the local file system.
Maybe it isn't possible ? What can I do ?

Path of jars added to a Spark Job - spark-submit

I am using Spark 2.1 (BTW) on a YARN cluster.
I am trying to upload JAR on YARN cluster, and to use them to replace on-site (alreading in-place) Spark JAR.
I am trying to do so through spark-submit.
The question Add jars to a Spark Job - spark-submit - and the related answers - are full of interesting points.
One helpful answer is the following one:
spark-submit --jars additional1.jar,additional2.jar \
--driver-class-path additional1.jar:additional2.jar \
--conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
--class MyClass main-application.jar
So, I understand the following:
"--jars" is for uploading jar on each node
"--driver-class-path" is for using uploaded jar for the driver.
"--conf spark.executor.extraClassPath" is for using uploaded jar for executors.
While I master the filepaths for "--jars" within a spark-submit command, what will be the filepaths of the uploaded JAR to be used in "--driver-class-path" for example ?
The doc says: "JARs and files are copied to the working directory for each SparkContext on the executor nodes"
Fine, but for the following command, what should I put instead of XXX and YYY ?
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path XXX:YYY \
--conf spark.executor.extraClassPath=XXX:YYY \
--class MyClass main-application.jar
When using spark-submit, how can I reference the "working directory for the SparkContext" to form XXX and YYY filepath ?
Thanks.
PS: I have tried
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path some1.jar:some2.jar \
--conf spark.executor.extraClassPath=some1.jar:some2.jar \
--class MyClass main-application.jar
No success (if I made no mistake)
And I have tried also:
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path ./some1.jar:./some2.jar \
--conf spark.executor.extraClassPath=./some1.jar:./some2.jar \
--class MyClass main-application.jar
No success either.
spark-submit by default uses client mode.
In client mode, you should not use --jars in conjunction with --driver-class-path.
--driver-class-path will overwrite original classpath, instead of prepending to it as one may expect.
--jars will automatically add the extra jars to the driver and executor classpath so you do not need to add its path manually.
It seems that in cluster mode --driver-class-path is ignored.

How to pass external resouce yml /property file while running spark job on cluster?

I am using spark-sql 2.4.1 version, jackson jars & Java 8.
In my spark program/job I am reading few configurations/properties from external "conditions.yml" file which is place in "resource" folder of my Java Project as below
ObjectMapper mapper = new ObjectMapper(new YAMLFactory());
try {
driverConfig = mapper.readValue(
Configuration.class.getClassLoader().getResourceAsStream("conditions.yml"),Configuration.class);
}
If I want to pass "conditions.yml" file from outside while submitting spark-job how to pass this file ? where it should be placed?
In my program I am reading from "resouces" directory i.e. .getResourceAsStream("conditions.yml") ...if i pass this property file from spark-submit ...will the job takes from here from resouces or external path ?
If I want to pass as external file , do I need to change the code above ?
Updated Question:
In my spark driver program I am reading the property file as program arguments
Which is being loaded as below
Config props = ConfigFactory.parseFile(new File(args[0]));
While running my spark job in shell script
I am giving as below
$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--name MyDriver \
--jars "/local/jars/*.jar" \
--files hdfs://files/application-cloud-dev.properties,hdfs://files/condition.yml \
--class com.sp.MyDriver \
--executor-cores 3 \
--executor-memory 9g \
--num-executors 5 \
--driver-cores 2 \
--driver-memory 4g \
--driver-java-options -Dconfig.file=./application-cloud-dev.properties \
--conf spark.executor.extraJavaOptions=-Dconfig.file=./application-cloud-dev.properties \
--conf spark.driver.extraClassPath=. \
--driver-class-path . \
ca-datamigration-0.0.1.jar application-cloud-dev.properties condition.yml
Error :
Not loading the properties... what is wrong here ? What is the correct way to pass the Program Args to Spark-Job Java program?
you will have to use --file path to your file in spark-submit command to be able to pass any files. please note this is
syntax for that is
"--file /home/user/config/my-file.yml"
if it is on hdfs then provide the hdfs path
this should copy the file to class path and your code should be able find it from the driver.
the implementation of reading the file should be done with something like this
def readProperties(propertiesPath: String) = {
val url = getClass.getResource("/" + propertiesPath)
assert(url != null, s"Could not create URL to read $propertiesPath properties file")
val source = Source.fromURL(url)
val properties = new Properties
properties.load(source.bufferedReader)
properties
}
hope that is what you are looking for
You can add:
spec:
args:
--deploy-mode
cluster

Passing multiple typesafe config files to a yarn cluster mode application

I'm struggling a bit on trying to use multiple (via include) Typesafe config files in my Spark Application that I am submitting to a YARN queue in cluster mode. I basically have two config files and file layouts are provided below:
env-common.properties
application-txn.conf (this file uses an "include" to reference the above one)
Both of the above files are external to my application.jar, so I pass them to yarn using the "--files" (can be seen below)
I am using the Typesafe config library to parse my "application-main.conf" and in this main conf, I am trying to use a property from the env.properties file via substitution, but the variable name does not get resolved :( and I'm not sure why.
env.properties
txn.hdfs.fs.home=hdfs://dev/1234/data
application-txn.conf:
# application-txn.conf
include required(file("env.properties"))
app {
raw-data-location = "${txn.hdfs.fs.home}/input/txn-raw"
}
Spark Application Code:
//propFile in the below block maps to "application-txn.conf" from the app's main method
def main {
val config = loadConfig("application-txn.conf")
val spark = SparkSession.builkder.getOrCreate()
//Code fails here:
val inputDF = spark.read.parquet(config.getString("app.raw-data-location"))
}
def loadConf(propFile:String): Config = {
ConfigFactory.load()
val cnf = ConfigFactory.parseResources(propFile)
cnf.resolve()
}
Spark Submit Code (called from a shell script):
spark-submit --class com.nic.cage.app.Transaction \
--master yarn \
--queue QUEUE_1 \
--deploy-mode cluster \
--name MyTestApp \
--files application-txn.conf,env.properties \
--jars #Typesafe config 1.3.3 and my app.jar go here \
--executor-memory 2g \
--executor-cores 2 \
app.jar application-txn.conf
When I run the above, I am able to parse the config file, but my app fails on trying to read the files from HDFS because it cannot find a directory with the name:
${txn.hdfs.fs.home}/input/txn-raw
I believe that the config is actually able to read both files...or else it would fail because of the "required" keyword. I verified this by adding another include statement with a dummy file name, and the application failed on parsing of the config. Really not sure what's going on right now :(.
Any ideas what could be causing this resolution to fail?
If it helps: When I run locally with multiple config files, the resolution works fine
The syntax in application-txn.conf is wrong.
The variable should be outside the string, like so:
raw-data-location = ${txn.hdfs.fs.home}"/input/txn-raw"

cant override Typesafe configuration on commmandline in spark

I have a typesafe configuration application.conf in the src/main/resourcesfolder which is loaded by default.
A single value can be overridden by specifying:
--conf spark.driver.extraJavaOptions=-DsomeValue="foo"
However, specifying a complete new, i.e. overriding application.conf file like:
spark-submit \
--class my.Class \
--master "local[2]" \
--files foo.conf \
--conf spark.driver.extraClassPath="-Dconfig.file=file:foo.conf" \
--conf spark.driver.extraJavaOptions=-Dvalue="abcd" \
job.jar
will fail to load foo.conf. Instead, the original file from the resources folder will be loaded.
Trying the tricks from: Using typesafe config with Spark on Yarn did not help as well.
edit
Overriding multiple config values in Typesafe config when using an uberjar to deploy seems to be the answer for plain (without spark) programs.
The question remains how to bring this to spark.
Also passing:
--conf spark.driver.extraClassPath="-Dconfig.resource=file:foo.conf"
--conf spark.driver.extraClassPath="-Dconfig.resource=foo.conf"
fails to load my configuration from the command line .
Though, according to the docs:
https://github.com/lightbend/config For applications using
application.{conf,json,properties}, system properties can be used to
force a different config source (e.g. from command line
-Dconfig.file=path/to/config-file):
config.resource specifies a resource name - not a basename, i.e. application.conf not application
config.file specifies a filesystem path, again it should include the extension, not be a basename
config.url specifies a URL
These system properties specify a replacement for
application.{conf,json,properties}, not an addition. They only affect
apps using the default ConfigFactory.load() configuration. In the
replacement config file, you can use include "application" to include
the original default config file; after the include statement you
could go on to override certain settings.
it should be possible with these parameters.
spark-submit \
--class my.Class \
--master "local[2]" \
--files foo.conf \
--conf spark.driver.extraJavaOptions="-Dvalue='abcd' -Dconfig.file=foo.conf" \
target/scala-2.11/jar-0.1-SNAPSHOT.jar
changing from spark.driver.extraClassPathto spark.driver.extraJavaOptions is doing the trick

Resources