spark-submit overriding default application.conf not working - apache-spark

I am building a jar which has application.conf under src/main/resources folder. However, I am trying to overwrite that while doing spark-submit. However it's not working.
following is my command
$spark_submit $spark_params $hbase_params \
--class com.abc.xyz.MYClass \
--files application.conf \
$sandbox_jar flagFile/test.FLG \
--conf "spark.executor.extraClassPath=-Dconfig.file=application.conf"
application.conf - is located in same directory my jar file is.

-Dconfig.file=path/to/config-file mayn't work due to internal cache on ConfigFactory. The documentation suggest to run ConfigFactory.invalidateCaches().
The other way is following, which merges the supplied properties with existing properties available.
ConfigFactory.invalidateCaches()
val c = ConfigFactory.parseFile(new File(path-to-file + "/" + "application.conf"))
val config : Config = c.withFallback(ConfigFactory.load()).resolve
I think the best way to override the properties would be to supply them using -D. Typesafe gives highest priority to system properties, so -D will override reference.conf and application.conf.

Considering application.conf is properties file. There is other option, which can solve the same purpose of using properties file.
Not sure but packaging properties file with jar might not provide flexibility? Here keeping properties file separate from jar packaging, this will provide flexibility as, whenever if any property changes just replace new properties file instead of building and deploying whole jar.
This can be achieved as, keep your properties in properties file are prefix your property key with "spark."
spark.inputpath /input/path
spark.outputpath /output/path
Spark Submit command would be like,
$spark_submit $spark_params $hbase_params \
--class com.abc.xyz.MYClass \
--properties-file application.conf \
$sandbox_jar flagFile/test.FLG
Getting properties in code like,
sc.getConf.get("spark.inputpath") // /input/path
sc.getConf.get("spark.outputpath") // /output/path
Not nessesary it will solve your problem though. But here just try to put another approach to work.

Related

Unable to read file in spark program from local directory

I am unable to read the local csv file in spark program. I am using PyCharm IDE. Although I am able to use the position argument to read the file but not with file location. Can someone please help?
// code
# Processing logic here...
flightTimeCsvDF = spark.read \
.format("csv") \
.option("header", "true") \
.load("data/flight*.csv")
# .load(sys.argv[1])
\\error
Exception in thread "globPath-ForkJoinPool-1-worker-1" java.lang.UnsatisfiedLinkError: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793)
at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1218)
at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1423)
at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
at org.apache.hadoop.fs.Globber.listStatus(Globber.java:128)
enter image description here
Please use the absolute path. From the image attached, I believe using the following will help solve the issue.
.load("C:\\Users\\psultania\\Anaconda3\\envs\\04-SparkSchemaDemo\\data\\flight*.csv")
If you are using different directories for input CSVs, please change the directory definition accordingly.
Yes it works using absolute path

How do I specify output log file during spark submit

I know how to provide the logger properties file to spark. So my logger properties file looks something like:
log4j.rootCategory=INFO,FILE
log4j.appender.FILE=org.apache.log4j.RollingFileAppender
log4j.appender.FILE.File=/tmp/outfile.log
log4j.appender.FILE.MaxFileSize=1000MB
log4j.appender.FILE.MaxBackupIndex=2
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
log4j.appender.FILE.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
log4j.appender.rolling.strategy.type = DefaultRolloverStrategy
And then I provide the logger properties file path to spark-submit via:
-Dlog4j.configuration=file:logger_file_path
However, I wanted to provide log4j.appender.FILE.File value during spark-submit. Is there a way I can do that?
With regards to justification for the above approach, I am doing spark-submit on multiple YARN queues. Since the Spark code base is the same, I would just want a different log file for spark submit on different queues.
In the log4j file properties, you can use expressions like this:
log4j.appender.FILE.File=${LOGGER_OUTPUT_FILE}
When parsed, the value for log4j.appender.FILE.File will be picked from the system property LOGGER_OUTPUT_FILE.
As per this SO post, you can set the value for the system property by adding -DLOGGER_OUTPUT_FILE=/tmp/outfile.log when invoking the JVM.
So using spark-submit you may try this (I haven't tested it):
spark-submit --master yarn \
--files /path/to/my-custom-log4j.properties \
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=my-custom-log4j.properties -DLOGGER_OUTPUT_FILE=/tmp/outfile.log" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=my-custom-log4j.properties -DLOGGER_OUTPUT_FILE=/tmp/outfile.log"

Can spark explode a jar of jar's

I have my spark job called like below:
spark-submit --jar test1.jar,test2.jar \
--class org.mytest.Students \
--num-executors ${executors} \
--master yarn \
--deploy-mode cluster \
--queue ${mapreduce.job.queuename} \
--driver-memory ${driverMemory} \
--conf spark.executor.memory=${sparkExecutorMemory} \
--conf spark.rdd.compress=true \
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -
XX:MaxGCPauseMillis=100
${SPARK_JAR} "${INPUT}" "${OUTPUT_PATH}"
Is is possible to pass a single jar which contain test1.jar and test2.jar . Like --jars mainTest.jar(this contain test1.jar and test2.jar)
My question is basically can spark explode a jar of jars . I am using version 1.3.
Question : Can spark explode a jar of jar's ?
Yes...
As T. Gaweda suggested, we can achieve with maven assembly plugin....
thought of putting other options here....
Option 1 : There is another Maven way (I feel better than maven assembly plugin....) i.e Apache Maven Shade Plugin
(particularly useful as it merges content of specific files instead of overwriting them. This is needed when there are resource files are have the same name across the jars and the plugin tries to package all the resource files.)
This plugin provides the capability to package the artifact in an
uber-jar, including its dependencies and to shade - i.e. rename - the
packages of some of the dependencies
Goals Overview
The Shade Plugin has a single goal:
shade:shade is bound to the package phase and is used to create a
shaded jar.
Option 2 :
SBT Way if you are using sbt :
source: creating-uber-jar-for-spark-project-using-sbt-assembly :
sbt-assembly is an sbt plugin to create a fat JAR of sbt project
with all of its dependencies.
Add sbt-assembly plugin in project/plugin.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.9.1")
Specify sbt-assembly.git as a dependency in project/project/build.scala
import sbt._
object Plugins extends Build {
lazy val root = Project("root", file(".")) dependsOn(
uri("git://github.com/sbt/sbt-assembly.git#0.9.1")
)
}
In build.sbt file add the following contents
import AssemblyKeys._ // put this at the top of the file,leave the next line blank
assemblySettings
Use full keys to configure the assembly plugin. For more details refer
target assembly-jar-name test
assembly-option main-class
full-classpath dependency-classpath assembly-excluded-files
assembly-excluded-jars
If multiple files share the same relative path the default strategy is to verify that all candidates have the same contents and error out otherwise. This behaviour can be configured for Spark projects using assembly-merge-strategy as follows.
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case PathList("javax", "servlet", xs # _*) => MergeStrategy.last
case PathList("org", "apache", xs # _*) => MergeStrategy.last
case PathList("com", "esotericsoftware", xs # _*) => MergeStrategy.last
case "about.html" => MergeStrategy.rename
case x => old(x)
}
}
From the root folder run
sbt/sbt assembly
the assembly plugin then packs the class files and all the dependencies into a single JAR file
You can simply merge those jars into one shaded Jar. Please read this question: How can I create an executable JAR with dependencies using Maven?
You will have all classes in exactly one Jar. There will be no problem with nested Jars.

Which directory contains third party libraries for Spark

When we use
spark-submit
which directory contains third party libraries that will be loaded on each of the slaves? I would like to scp one or more libraries to each of the slaves instead of shipping the contents in the application uber-jar.
Note: I did try adding to
$SPARK_HOME/lib_managed/jars
But the spark-submit still results in a ClassNotFoundException for classes included in the added library.
Hope these points will help you.
$SPARK_HOME/lib/ [contains the jar files ]
$SPARK_HOME/bin/ [contains the launch scripts - Spark-Submit,Spark-Class,pySpark,compute-classpath.sh etc]
Spark-Submit ---will call ---> Spark-Class.
Spark-class internally calls compute-Classpath.sh before executing / launching the job.
compute-Classpath.sh will pick the jars availble in $SPARK_HOME/lib to CLASSPATH.
(execute ./compute-classpath.sh //returns jars in lib dir)
So try these options.
option-1 - Placing user-specific-jars in $SPARK_HOME/lib/ will works
option-2 - Tweak compute-classpath.sh so that it will be able to pic
your jars specified in a user specific jars dir

how to find HADOOP_HOME path on Linux?

I am trying to run the below java code on a hadoop server.
javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java
but I am not able to locate {HADOOP_HOME}. I tried with hadoop -classpath but it is giving output as below:
/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-0.20-mapreduce/./:/usr/lib/hadoop-0.20-mapreduce/lib/*:/usr/lib/hadoop-0.20-mapreduce/.//*
Anyone has any idea about this?
Navigate to the path where hadoop is installed. locate ${HADOOP_HOME}/etc/hadoop, e.g.
/usr/lib/hadoop-2.2.0/etc/hadoop
When you type the ls for this folder you should see all these files.
capacity-scheduler.xml httpfs-site.xml
configuration.xsl log4j.properties
container-executor.cfg mapred-env.cmd
core-site.xml mapred-env.sh
core-site.xml~ mapred-queues.xml.template
hadoop-env.cmd mapred-site.xml
hadoop-env.sh mapred-site.xml~
hadoop-env.sh~ mapred-site.xml.template
hadoop-metrics2.properties slaves
hadoop-metrics.properties ssl-client.xml.example
hadoop-policy.xml ssl-server.xml.example
hdfs-site.xml yarn-env.cmd
hdfs-site.xml~ yarn-env.sh
httpfs-env.sh yarn-site.xml
httpfs-log4j.properties yarn-site.xml~
httpfs-signature.secret
Core configuration settings are available in hadoop-env.sh.
You can see classpath settings in this file and I copied some sample here for your reference.
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67
# The jsvc implementation to use. Jsvc is required to run secure datanodes.
#export JSVC_HOME=${JSVC_HOME}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR}
# Extra Java CLASSPATH elements. Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH+$HADOOP_CLASSPATH:}$f
done
Hope this helps!
hadoop-core jar file is in ${HADOOP_HOME}/share/hadoop/common directory, not in ${HADOOP_HOME} directory.
You can set the environment variable in your .bashrc file.
vim ~/.bashrc
Then add the following line to the end of .bashrc file.
export HADOOP_HOME=/your/hadoop/installation/directory
Just replace the path with your hadoop installation path.

Resources